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PREFACE 


This  report  has  been  prepared  with  the  aid  of  an  on-line  graphi:al 

text  editor  deficient  .n  its  lack  of  underlining.  UPPER  CASE  is  usee 

throughout  in  plsce  of  underlining.  We  hope  that  this  causes  ro 
confusion. 

References  in  the  text  are  cited  by  author  name(s)  and  year,  and  are 
given  in  Chapter  8.  References  cited  with  an  "A2"  instead  of  the  year 
refer  to  system  descriptions  contained  in  Appendix  2.  For  example, 
(Wensley  72)"  refers  to  a  reference  provided  in  Chapter  3,  while 
"(Wensley  A2)"  refers  to  the  description  of  a  particular  system  found 

in  Appendix  2.  Further  references  are  also  found  in  each  of  the 
appendices. 

In  the  light  of  the  existence  of  several  extremely  comprehensive 
bibliographies  in  the  area  of  fault  tolerant  computirg  (cited  at  the 
beginning  of  Chapter  8),  we  have  chosen  to  be  selective  in  our 
references.  Where  a  multiplicity  of  references  is  relevant,  we  have 
sometimes  chosen  to  cite  only  the  most  recent  ones,  so  that  the 
interested  reader  can  pursue  earlier  references  by  indirection. 
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A  STUDY  Or  FAULT-TOLERANT  COMPUTING:  FINAL  TECHNICAL  REPORT 

Peter  G.  Neumann,  Jack  Goldberg,  Karl  K.  Levitt  and  John  i>.  Veusley 
Computer  Science  Group,  Stanford  Research  Institute,  Menlo  Park,  CA 

CHAPTER  1.  SUMMARY  OF  ThIS  REPORT 

This  report  presents  the  results  of  a  study  of  the  state  of  the  art  of 
designing  fault-tolerant  computing  systems.  This  chapter  provides  a 
summary  of  the  technical  problem,  the  technical  results,  the  relevance 
of  the  study  to  the  Department  of  Defense,  to  users  and  to  vendors,  and 
Implications  for  future  research  and  development. 

1.1.  THE  TECHNICAL  PROBLEM 

The  purpose  of  this  study  is 

*  To  survey  and  evaluate  existing  systems,  system  concepts,  and  relevant 
existing  theory,  in  order  to  assess  the  art  of  designing  effective  and 
economical  fault-tolerant  systems. 

*  To  define  and  evaluate  new  approaches  to  the  design  of  computing 
systems  with  improved  fault  tolerance. 

The  system  goals  of  interest  include  the  attainment  of: 

*  CORRECTNESS —  High  degrees  of  correct  operation  despite  the  occurrence 
of  faults  in  hardware. 

*  AVAILABILITY—  Very  high  system  availability  (i.e.,  very  little 
down-time) ,  with  little  or  no  emergency  maintenance  and  possibly  very 
little  maintenance  at  all. 

*  RECOVERY —  Rapid  recovery  from  faults  not  immediately  tolerated,  with 
litited  but  known  (and  usually  recoverable)  losses. 


*  ECONOMY —  Low  redundancy  relative  to  system  replication,  and  low  coat 
of  fault  tolerance  relative  to  the  total  environment  in  which  the 
computer  system  exists. 

System  goals  are  considered  that  might  require  massive  equipment 
redundancy  (e.g.,  for  extremely  high  correctness,  or  very  long 
zero-mrintenance  lifetime,  or  extremely  fast  recovery),  but  these  goals 
are  not  of  primary  interest  here, 

1.2.  TECHNICAL  RESULTS 

The  basic  conclusion  of  this  study  is  that  substantial  fault  tolerance 
can  be  achieved  at  surprisingly  low  cost  (in  both  hardware  and  software) 
under  a  wide  range  of  operating  requirements.  The  fault  tolerance 
attainable  with  the  present  state  of  the  art  is  much  greater  than  in 
present  systems,  and  satisfies  many  present  demands.  Further 
improvements  are  also  possible  that  would  allow  design  of  still  more 
powerful  systems.  This  basic  conclusion  is  especially  applicable  to 
large  systems  with  flexible  real-time  requirements.  Such  systems 
Include  general-purpose  systems,  message  store-and-forward  systems, 
communications  processors,  and  networks. 

In  Chapters  3,  4,  and  5  of  this  report,  we  review  many  techniques  for 
fault  tolerance.  These  techniques  facilitate  the  detection,  isolation, 
location,  and  removal  of  errors,  and  the  recovery  from  the  effects  of 
these  errors.  In  Section  3.3,  system  architectures  are  considered, 
including  a  variety  of  simplex  and  multiprocessor  configurations.  In 
Chapter  6,  the  techniques  for  fault  tolerance  are  applied  to  these 
architectures.  Quantitative  measures  of  system  correctness, 
availability,  recovery,  and  cost  are  given  for  each  of  these 
architectures. 

We  find  no  fundamental  gups  in  the  state  of  the  hardware  design  erf. 
preventing  the  attainment  of  high  degrees  of  fault  tolerance  at  lofc  cost 
relative  to  the  overall  system  —  except  for  questions  of  recovery  speed 
and  very  long  unattended  life  discussed  beiow.  On  the  one  hand,  there 
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are  app’ications  in  which  the  computer  systems  represent  only  a  small 
portion  of  the  total  costs,  e.g.,  in  special-purpose  control 
applications.  In  these  cases,  the  extreme  solution  of  replication  of 
computer  equipment  with  comparison  or  voting  may  be  economical  overall. 
In  most  existing  systems  and  system  designs  for  such  applications,  which 
provide  guaranteed  fault  tolerance  for  essentially  all  single  faults, 
anywhere  from  60  to  80  percent  of  the  hardware  is  typically  devoted  to 
fault  tolerance.  On  the  other  hand,  we  find  in  general  that  10  to  AO 
percent  of  the  hardware  devoted  to  fault  tolerance  is  sufficient  to 
achieve  adequate  correctness  and  availability  for  many  systems,  except 
when  all  system  results  have  highly  critical  real-time  requirements  on 
correct  performance.  Such  low  redundancy  can  be  achieved  by  a 
combination  of  techniques  (both  existing  and  newly  developing),  and  by 
enreful  use  of  structure  in  the  system.  Such  structure  facilitates 
taking  advantage  of  the  nonuniformity  of  internal  system  requirements, 
and  permits  various  fault-tolerance  techniques  to  be  used  when  and  where 
they  a’-e  most  effective,  rather  than  uniformly.  The  resulting 
partitioning  makes  complete  single-fault  tolerance  necessary  only  within 
certain  critical  partitions.  It  also  facilitates  speedy  recovery  when 
essential.  Such  structure  also  facilitates  graceful  degradation  of 
performance. 

We  do  find  fundamental  gaps  in  the  art  of  designing  and  implementing 
software  to  support  hardware  facilities  for  fault  tolerance.  This  art 
is  notably  weak  in  the  areas  of  specifying  and  verifying,  system  designs 
and  implementations  in  a  way  that  unifies  hardware  and  software  with 
proper  consideration  of  operational  needs.  This  weakness  is  especially 
evident  with  respe  ct  to  the  poor  state  of  operating  systems,  and  in  the 
adequate  coverage  and  resolution  of  system  diagnostics.  As  noted  above, 
present  systems  and  the  present  design  art  are  seriously  deficient  in 
the  speed  of  recovery  following  faults.  Solutions  to  this  problem 
require  advances  in  haidware  and  software  for  improved  diagnosability , 
and  in  total  hardware-  software  integration.  The  art  is  also  weak  in 
maintaining  smoothly  degradable  performance  iu  a  low-maintenance 
environment. 
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In  modern  computer  design,  boundaries  between  hardware  and  software  are 
becoming  increasingly  diffuse.  There  is  a  real  need  to  upgrade  the 
fault-tolerance  design  art  so  that  it  can  become  a  standard  facet  of 
computer  design.  Significant  effort  is  required  in  system  software  and 
system  operations  to  assure  that  good  hardware  development  is  not 
compromised.  We  emphasize  the  critical  importance  of  developing 
balanced  system  designs  well  suited  to  particular  system  needs.  It  is 
helpful  if  system  goals  can  be  integrated  from  the  beginning,  rather 
than  retrofitting  fault  tolerance  into  a  system  not  designed  with  it  in 
mind.  To  this  end,  the  concept  of  system  structure  is  useful  in  all  the 
stages  of  system  development,  including  the  design,  implementation, 
verification,  use,  operation,  and  evolution  of  the  system.  Structured 
design  and  implementation  hold  great  promise  for  improving  the  art,  not 
just  for  fault  tolerance,  but  for  computer  system  design  in  general. 

In  many  cases,  high  availability  cannot  be  achieved  without  a  secure 
system  (employing  protection  mechanisms  in  the  hardware  and  operating 
system  to  assure  that  it  is  relatively  crash-proof).  In  turn,  such 
security  cannot  be  achieved  without  high  reliability,  especially  in  the 
portion  of  the  system  that  affects  security.  System  structure  can  also 
be  helpful  in  achieving  the  goals  of  security. 

Our  conclusions  also  include  implications  of  system  design  on  the 
operational  and  human  aspects,  which  play  a  critical  role  in  keeping  a 
system  highly  available.  These  include  not  only  considerations 
affecting  correctness,  continued  system  availability,  and  rapid 
recovery,  but  a’so  those  lessening  the  critical  dependence  on 
auministrators ,  skilled  operators,  and  maintenance  personnel. 

We  conclude  that  the  attainment  of  a  sufficiently  fault-tolerant  system 
is  possible  for  various  particular  applications  at  relatively  low  cost. 
However,  considerable  care  and  common  sense  are  still  required  in  system 
implementation.  Our  survey  of  existing  systems  shows  that  seemingly 
obvious  measures  for  fault  avoidance  are  often  ignored.  If  the  reader 
occasionally  finds  a  statement  that  seems  obvious,  it  may  be  included 
here  for  completeness,  or  because  it  serves  as  a  basis  for  subsequent 
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discussion,  or  because  there  are  hidden  difficulties  in  implementation. 

We  pose  the  challenge  to  practitioners  and  theoreticians  of  fault 
tolerance  to  find  structures  and  theories  that  move  these  "obvious" 
design  decisions  from  the  domain  of  good  judgment  to  that  of  systematic 
practice. 

Many  of  the  techniques  discussed  in  this  report  are  useful  with 
present-day  technologies.  Others  are  particularly  suited  to  emerging 
technologies  such  as  LSI,  which  can  have  a  significant  effect  on  system 
pault  tolerance,  e.g.,  due  to  compactness,  low  heat  generation,  and  low 
cost.  These  latter  technologies  will  permit  the  use  of  techniques  not 
previously  practical.  However,  the  trend  to  high-density  systems  using 
advanced  technologies  (for  memory  as  well  as  processing)  will  not 
obviate  the  need  for  architectural  measures  to  achieve  fault  tolerance. 

It  is  true  that  the  unit  reliability  of  new  LSI  devices  is  not  much  less 
than  IC  and  MSI  devices,  while  the  devices  are  significantly  more 
powerful.  Thus,  a  given  function  may  be  realized  in  LSI  with  higher 
reliability  due  to  the  use  of  fewer  devices.  However,  while  the  number 
of  devices  per  function  decreases,  there  is  a  strong  tendency  for  large 
general-purpose  systems  to  grow,  up  to  the  limits  of  cost.  For  example, 
there  is  a  trend  toward  increasingly  powerful  hardware  in  order  to 
simplify  programming. 

An  additional  factor  is  the  limit  on  reliability  imposed  by  the  high 
cost  of  testing  a  device  to  an  assured  level  of  reliability.  While  a 
device  may  be  extremely  reliable,  the  system  designer  can  assume  only 
that  reliability  that  can  be  demonstrated.  The  current  practical  limit 
on  testable  failure  rates  for  a  device  ranges  from  10  to  lo”  failures 
per  hour.  This  implies  that,  for  very  large  systems,  the  projected 

system  lailure  rate  would  be  10  to  10  lailures  per  hour.  This  i  learly 
requires  system-level  fault-tolerance  measures. 

The  work  reported  here  is  novel  in  several  respects.  It  represents  both 
theoretical  and  practical  approaches  to  economical  fault  tolerance, 
rather  than  the  use  of  missive  redundancy.  It  provides  a  framework  for 
a  unified  hardware /so f tv  are  approach  to  system  design  for  fault 
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tolerance.  It  afcempts  to  show  explicit  cost  figures  for  fault  tolerance 
over  a  wide  range  of  architectures.  It  also  includes  several  new 
theroretical  results  on  reconfigurahle  memories  and  on  coding  for 
arithmetic. 

1.3.  RELEVANCE  01  THIS  STUDY 

This  work  is  applicable  to  many  kinds  of  computing  systems.  These 
include  systems  with  general-purpose  and/or  special-purpose  capability, 
and  network  cont-ol  computers  such  as  the  interface  message  processors 
(IMPa)  in  the  ARP  A  network. 

1.3.1.  RELEVANCE  TO  THE  DEPARTMENT  OF  DEFENSE 


Specific  conclusions  of  our  study  affecting  the  Department  of  Defense 
include  the  following. 

*  Significantly  better  fault  tolerance  (e.g.,  correct  behavior,  high 
availability,  rapid  recovery,  and  high  system  security)  can  be  obtained, 
even  in  the  presence  of  malfunctions. 

*  Significantly  more  economical  fault  tolerance  can  >e  achieved,  with 
more  efficient  use  of  redundancy,  more  remote  diagnosis  and  maintenance, 
more  automatic  self-maintenance  (e.g.,  the  use  of  spares,  with  automatic 
reconfiguration),  and  less  emergency  maintenance.  More  automatic 
operation  will  result  in  reducing  the  unnecessary  reliance  on 
potentially  unreliable  people  in  critical  positions. 

*  While  the  primary  scope  of  this  report  involves  the  design  of  large 
general-purpose  systems,  there  is  considerable  potential  for 
applicability  to  tactical  and  other  real-time  control  systems. 

*  Significant  effort  must  be  expended  to  assure  overall  system 
reliability,  e.g.,  effort  concerning  good  software  design  and 
implementation,  reliable  operations  personnel,  and  other  system  support. 
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Otherwise,  good  hardware  design  may  be  wasted.  In  addition,  history 
shows  that  computer  manufacturers  have  been  slow  to  respond  to  customer 
needs  that  have  not  been  clearly  and  forcefully  enunciated.  We  feel 
that  if  DOD  wishes  to  have  systems  with  economical  fault  tolerance,  !.t 
must  stimulate  manufacturers  to  develop  such  systems  by  defining  and 
enforcing  fault-tolerance  requirements  in  terms  of  realizable 
specifications. 

1.3.2.  RELEVANCE  TO  USER  COMMUNITIES 


The  recomnendations  here  generally  make  most  of  the  mechanisms  for 
achieving  high  reliability  and  high  availability  invisible  to  system 
users  during  system  operation.  However,  user  conmunities  will  have  to 
exhibit  greater  awareness  of  what  can  be  achieved  and  what  they  might 
require.  They  should  clearly  define  their  needs,  and  exhibit 
considerable  unity  in  presenting  these  needs  to  the  vendors. 


1,3.3.  RELEVANCE  TO  THE  VENDORS 


In  recent  years  several  commercial  vendors  have  undertaken  serious 
erfort  toward  achieving  fault  tolerance  in  computer  systems,  primarily 
in  the  light  f  aerospace  needs.  Several  useful  steps  (orward  have  also 
been  taken,  in  some  recent  commercial  systems,  such  as  the  use  of 
error-correcting  codes  and  instruction  retry,  and  the  use  of 
hierarchical  recovery  strategies.  It  is  hoped  that  this  report  will  be- 
helpful  to  all  system  development  efforts  in  focusing  attention  on  fault 
tolerance  as  an  integral  part  of  system  development,  t specially  since 
much  can  be  done  at  low  cost.  Some  of  the  techniques  herein  can  be 
retrofitted  onto  existing  systems.  However,  it  is  most  cost-effective 
to  integrate  fault  tolerance  into  the  overall  design. 

I.h.  IMPLICATIONS  FOR  FUTURE  RESEARCH  AND  DEVELOPMENT 


While  our  basic  finding  is  that  the  present  art  provides  the  basis  for 
reliable  systems  at  reasonable  costs,  there  exist  limitations  which,  if 
overcome,  could  result  in  further  significant  improvements  (e.g.,  by 
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reducing  recovery  time,  by  reducing  the  residual  error  rate,  and  by 
further  reducing  the  coat).  In  Chapter  7,  we  summarize  some 
recoonendations  for  achieving  such  improvemer  .8.  '  heae  include  better 

techniques  for  error  detection  and  fault  diagnosis,  novel  architectures 
specifically  suited  to  fault  tolerance,  and  significantly  improved 
techniques  for  the  analysis  of  fault- tolerant  systems. 
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CHAPTER  2.  INTRODUCTION 


As  used  here,  the  term  "fault  tolerance"  ia  used  broadly  to  Bean  the 
ability  of  a  system  to  withstand  various  'cinds  of  hardware  malfunctions 
and  mishaps.  There  are  varying  degrees  of  fault  tolerance,  including 
continued  correct  performance  for  some  portion  of  the  system,  and 
continued  availability  of  some  portion  of  the  system,  although  possibly 
with  degraded  capacity.  There  are  increasingly  many  applications 
requiring  much  better  fault  tolerance  than  is  currently  available. 

Those  of  interest  here  include  general-purpose  systems  with  both  batch 
and  interactive  capabilities,  as  well  as  various  special-purpose  systems 
such  as  message  switching  systems.  Our  emphasis  is  on  economical  fault 
tolerance  for  applications  with  varying  real-time  criticalities.  The 
work  is  also  relevant  to  various  aerospace  applications  that  are 
currently  approached  with  massive  redundancy. 

We  are  concerned  primarily  with  system- level  technlq  les  for  increasing 
fault  tolerance,  rather  than  with  techniques  for  improving  the 
reliability  of  various  technologies.  Thus,  we  focus  largely  on  system 
architecture.  This  chapter  provides  an  introduction  to  the  report. 
Chapter  3  gives  a  guide  to  the  techniques  for  fault  tolerance  useful  at 
various  system  levels,  and  illustrates  their  applicability  to  system  and 
network  architecture.  Included  are  simplex  systems  and  multiprocessors 
(with  widely  varying  degrees  of  parallelism,  independence,  and  common 
information  access).  The  chapter  also  discusses  the  role  of  structure 
in  the  attainment  of  economical  fault  tolerance.  Chapters  A  and  b 
present  some  advances  in  architectural  techniques  for  fault  tolerance, 
Chapter  A  considering  memory,  and  Chapter  5  considering  arithmetic, 
logic,  and  control.  Chapter  6  considers  different  application  fields 
(special-purpose,  aerospace,  communications,  etc.)  and  presents  -be 
special  requirements  for  fault  tolerance  in  each  field.  From  these 
special  requirements,  appropriate  techniques  and  architectures  are 
derLved,  and  their  effectiveness  considered.  Chapter  7  provides  the 
conclusions  of  our  study,  along  with  specific  recommendations  for  future 


Severil  appendices  are  included.  Appendix  1  provides  a  census  of  fault 
tolerant  systems.  Appendix  2  provides  a  detailed  survey  of  various  • 
representative  systems.  Appendix  3  gives  substantially  greater  detail 
to  support  the  memory  organizations  of  Chapter  4.  Finally,  Appendix  4 
presents  some  new  results  on  byte  coding  for  arithmetic. 


2.1.  BASIC  DEFINITIONS  AND  ASSUMPTIONS 


In  this  section  we  present  definitions  of  the  basic  terms  associated 
with  fault-tolerant  systems.  In  addition,  we  present  a  few  assumptions 
that  'a/e  guided  us  in  the  design  Approaches  considered  here. 

FAUL,  ERRORS  AND  FAILURES 

The  terms  "fault"  and  "error"  are  defined  with  respect  to  the  interface 
of  a  hardware  or  software  mechanism,  e.g.,  a  component  or  a  subsystem, 
whose  output  is  observable  at  least  to  some  other  mechanism.  An  ERROR 
is  a  disparity  between  the  actual  output  at  such  an  interface  and  '.he 
value  expected  under  normal  operation.  Examples  are  an  incorrect  result 
from  an  arithmetic  unit,  an  incorrect  word  in  a  memory  unit,  and  an 
incorrect  word  involving  an  inrut-output  device.  Errors  may  be  SINGLE 
or  MULTIPLE,  depending  on  their  nature.  I’nr  example,  an  additive  error 
in  a  single  bit  position  of  an  adder  could  affect  several  bit  positions 
(with  carries) ,  and  would  appear  to  be  a  multiple  error  in  memory. 

Errors  may  be  DETECTED  or  UNDETECTED  at  a  particular  Interface.  For 
example,  single  memory  errors  are  detected  by  timple  parity  checking  in 
memory,  but  double  errors  (or  quadruple  enors,  etc.)  are  not.  Errors 
not  detected  at  one  interface  may  be  subsequently  detected  at  another 
(higher-level)  interface,  e.g.,  via  consistency  checks. 

I 

A  FAULT  is  ^n  internal  malfunction  within  a  mechanism.  It  may  or  may 
not  result  In  an  observable  error.  This  depends  on  the  data  that  are 
actually  entered  to  the  mechanism,  whether  or  not  the  faulty  part  is 
redundant,  and  whether  or  not  the  mecanism  has  internal  fault-tolerance 
capability.  Faults  may  be  transient,  intermittent,  or  permanent.  A 
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TRANSIENT  fault  is  one  that  occuis  once,  leaving  the  hardware  in  a 
fault-free  condition,  but  with  poss  ble  effects  on  the  software  and  on 
system  operation.  An  INTERMITTENT  fault  is  one  that  recurs,  with 
intervening  fault-free  periods.  A  PERMANENT  fault  is  one  that  persists 
steadily  without  interruption.  In  harcware  a  transient  fault  may  become 
intermittent,  and  an  intermittent  fault  may  become  permanent.  (The 
terms  transient"  and  "intermittent"  are  often  merged.)  A  transient 
fault  might  be  caused,  for  example,  by  interference  on  a  bus.  A 
permanent  fault  might  be  due  to  a  shorted  transistor,  shorted  wires,  an 
open  connection,  or  a  power  supply  fluctuation,  for  example. 

Faults  are  hardware  phenomena,  and  ai-*  potential  sources  of  system 
errors.  Other  sources  of  system  errors  also  exist,  e.g.,  mistakes  in 

design,  or  misuse.  Examples  of  potential  sources  of  errors  are  found  in 
Table  2.1.  (See  also  Yourdon  72.) 

The  mechanisms  of  trai.sient  faults  are  not  so  well  understood  as,  for 
example,  permanent  faults,  but  several  observations  are  relevant  here. 

*  In  many  technologies,  transient  and  intermittent  faults  seem  to 
dominate  permanent  faults  by  at  least  an  order  of  magnitude.  This 
dominance  is  partly  because  the  nonpermanent  faults  are  harder  to  find, 
and  thus  are  usually  not  found  oe fore  they  can  recur.  If  they  degenerate 
to  permanent  faults,  they  usually  become  more  readily  identifiable, 

■*  A  major  cause  of  errors  is  poor  design,  e.g.,  in  not  properly  handling 
the  occurrences  of  exceptional  cases  (e.g.,  electrical  disturbances). 
Examples  of  such  cases  are  undesirable  circuit  coupling  that  is  data 

dependent,  unusual  timing  dependencies,  and  marginally  designed  power 
suppliies. 

INDEPENDENCE.  An  important  property  of  multiple  faults  and  multiple 
errors  is  their  relative  INDEPENDENCE  or  DEPENDENCE.  In  the  case  of  an 
LSI  realization,  a  fault  within  a  chip  can  result  in  multiple 
(dependent)  errors  f^om  that  chip.  Faults  in  different  chips  should  be 
considered  as  independent  if  adequate  protection  exists  at  cnip 


interfaces.  For  conventional  core  memories,  core  and  line  j  'iver  faults 
seem  to  occur  independently  of  one  another.  Thus,  f°r  each  aut  ystem 
there  is  a  primitive  element  or  a  set  of  primitive  elements  to  which 
faults  can  be  ascribed. 

The  events  following  a  fault  are  summarized  in  Figure  2.1.  When  a  fault 
is  detected  (e.g.,  via  coding  or  duplication,  or  implicitly  by  fault 
masking),  a  recovery  strategy  is  invoked,  tiowever,  as  long  as  a  fault 
remains  undetected,  the  effects  of  the  fault  may  propagate.  It  may  even 
be  compounded  by  further  (dependent  or  independent)  faults  or  by  being  a 
REPEATED-USE  fault  (Avizicnis  72) .  In  many  cases  farlty  behavior  is 
ultimately  detected  (although  in  extreme  cases  perhaps  only  by 
complaints  following  a  system  crash),  at  which  point  recovery  is 
attempted. 

The  possible  effects  of  undetected  errors  are  f.uite  varied.  There  is  a 
wide  range  of  effects  of  faults  on  system  behavior.  There  are  many 
forms  of  '•'crashes",  gradual  or  sudden,  impairing  in  varying  degrees 
correctness,  availability,  performance  and  security.  However,  it  is  not 
necessary  that  all  errors  be  detected  in  all  situations.  For  example, 
in  a  time-sharing  environment  most  users  are  willing  to  accept 
occasional  errors  due  to  hardware  faults,  provided  either  they  or  the 
system  can  detect  the  errors,  and  provided  adequate  recovery  and  file 
integrity  are  available.  Users  are  normally  not  willing  to  accept 
frequent  crashes,  long  outages,  or  loss  of  on-line  files  maintained  by  a 
system  whose  intent  is  to  eliminate  the  need  for  private  backup.  In 
usoge  here,  a  FAILURE  is  an  error  whose  effect  is  in  some  sense 
critical.  Vtrious  senses  of  "critical"  are  discussed  in  Section  3.2.2.. 

RELIABILITY,  CORRECTNESS,  AVAILABILITY,  AND  FAULT  TOLERANCE. 

FAULT  TOLERANCE  is  (roughly  speaking)  the  ability  of  a  system  to 
withstand  faults.  The  significant  effects  under  consideration  here  are 
LOSS  OF  CORRECTNESS  (e.g.,  as  the  result  of  errors  in  processing  and  in 
storage  —  the  latter  including  damage  to  stored  programs)  and  LOSS  OF 
AVAILABILITY  (e.g.,  the  loss  of  computing  capacity  or  storage  capacity. 
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FIGURE  2.1  MODEL  OF  FAULTY  BEHAVIOR 
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SOiRCES  OF  HARDWARE  FAULTS 

Physical  bonds  and  loose  connectors 
Wear  in  moving  parts 
Material  aping 
Insulation  breakdown 

Environmental  effects  (e.g.,  temperature,,  humidity, 

vibrations,  electrical  and  electromagnetic  disturbances) 
Human-induced  breakage 


OTHER  SOURCES  (HARD,  SOFT,  OPERATIONAL) 

Inadequate  design  and  implementation: 

Lack  of  checking,  and  validation  in  interfaces, 
especially  in  response  to  unanticipated  conditions 

Sensitivity  to  timing  variations 
Data  dependency  effects 

Usage-induced  hardware  damage 
Inadequate  system  security 
Inadequate  system  verification 
Acts  of  God  (lightning,  floods,  etc.) 

People  problems  (e.g.,  adr  ..istration,  maintenance, 
concurrent  development,  operators,  documentation) 
Power  sources,  local  and  public  utilities 
Support  functions  (e.g.,  air  conditioning) 


or  of  response  time).  A  violation  of  system  security  may  lead  to  loss 
of  both  correctness  and  availability,  as  well  as  loss  of  other  aspects. 
The  absence  of  an  expected  output  may  also  lead  to  the  loss  o i 
availability  or  correctness  or  both,  depending  on  the  application. 
AVAILABILITY  is  thus  a  measure  of  having  operative  resources  that  can  be 
called  upon  to  handle  a  task.  CORRECTNESS  is  a  measure  of  how 
error-free  a  result  is  at  some  interface  of  interest.  The  term 
"RELIABILITY"  is  little  used  in  this  report  in  its  standard  meaning  of 
the  probability  of  correct  behavior  at  a  specific  time.  The  term 
"RELIABLE"  is  used  in  a  qualitative  sense  to  denote  correctness  and/or 
availability. 

As  a  means  for  evaluating  a  given  system,  it  is  desirable  to  derive 
quantitative  measures  of  correctness  and  availability.  The  classical 
measures  "mean  rime  to  failure"  and  "mean  time  to  repair"  are  not  by 
themselves  adequate  measures  for  most  complex  systems.  Better  measures 
are  probabilities  as  a  function  of  time  that  certain  resources  are 
available  and  that  certain  data  are  correct. 

It  is  readily  seen  that  a  wide  range  of  effects  is  possible.  For  a 
given  fault  (or  combination  of  faults),  these  effects  may  range  widely 
in  their  fault  tolerance  between  two  extremes  —  from  complete  fault 
tolerance  (with  no  incorrect  results  visible  externally)  to  a  total 
collapse  of  the  system.  In  the  latter  extreme  there  may  be  extensive 
loss  of  correctness  and  availability  for  a  protracted  time  during  and 
after  the  collapse,  and  lengthy  delays  until  correct  performance  and 
adequate  capacity  are  agai  available.  Between  these  extremes  are 
various  forms  of  partial  collapse,  with  varying  degrees  of 
recoverability.  The  early  detection  of  faults  iu  also  important  to 
prevent  potential  security  violations  that  may  result  from  faulty 
behavior.  Upon  detection,  diagnostic  procedures  can  be  used  to  essesa 
the  scope  of  the  error  propagation,  and  appropriate  recovery  procedures 
initiated. 

i 

There  are  two  ways  of  using  the  concepts  of  availability  and  correctness 
to  design  a  system.  First,  for  many  applications  one  design  goal  is  to 


eliminate  down-time  entirely  or  at  least  to  reduce  it  to  a  negligible 
amount.  "Negligible"  might  mean  seconds  in  the  case  of  a  telephone 
utility,  or  minutes  in  the  case  of  a  time-shared  facility,  but  in  ary 
event  the  intent  is  to  keep  the  machine  running  despite  faults,  pending 
maintenance.  For  such  applications  it  is  usually  sufficient  to  provide 
single-fault  tolerance  for  such  critical  functions  as  file  handlers, 
memory  managers,  and  restart  and  recovery  procedures,  plus  sufficient 
redundant  hardware  so  that  a  working  system  can  be  configured  after  the 
occurrence  of  each  fault.  Second,  for  applications  where  the  computer 
is  so  remote  as  to  preclude  maintenance,  the  important  issues  are: 

(a)  the  probability  that  the  computer  has  sufficient  resources  left 
after  a  period  of  time,  and  (b)  the  probability  that  correct  answers  are 

produced  for  certain  critical  functions.  The  aerospace  environment  is 
perhaps  the  main  current  example  of  this  approach,  although  some 
transportation  systems,  electric  power  systems,  financial  systems  tnd 
secure  systems  have  also  been  beneficiaries  of  fault-tolerance 
techniques.  Jn  any  event  it  might  be  necessary  for  a  computer  in  such 
an  environment  to  tolerate  many  faults. 

REDUNDANCY .  An  important  measure  of  the  effectiveness  of  any 
fault-tolerant  system  is  its  REDUNDANCY.  Let  "k"  be  the  cost  of 
hardware  needed  in  the  absence  of  any  fault-tolerant  requii’ement ,  and 
let  "r"  be  the  cost  of  extra  hardware  needed  to  achieve  fault-tolerant 
behavior.  Then  the  relative  redundancy  "R"  is  R  ■  r/(k+r)  *  r/n,  as  a 
fraction  of  the  total  cost  "n"  of  the  system.  (This  measure  Is  used 
more  or  less  exclusively  throughout  this  report,  rather  than  the 
alternative  approach  of  citing  the  percentage  increase  over  a  comparable 
intolerant  machine,  e.g.,  200  percent  for  triplication.)  This 
definition  is  consistent  with  the  coding  theory  concept  of  redundancy, 
in  which  k,  r,  and  n  are  measured  in  bits.  Except  in  relatively  trivial 
system  configurations,  it  is  a  difficult  chore  to  estimate  the 
redundancy.  For  a  system  that  just  employs  triplication  of  certain 
hardware  blocks  together  with  appropriate  voters,  the  redundancy  is 
(2+v)/(3+v) ,  where  "v"  is  the  cost  of  the  hardware  voters  relative  to 
the  functional  block.  The  evaluation  of  redundancy  is  more  difficult  in 
a  situation  where,  for  example,  a  multiprocessor  is  used  solely  to 
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achieve  fault  tolerance.  That  la,  if  fault  tolerance  were  not  a 
requirement,  then  a  conventional  uniprocessor  might  suffice.  Here  "r" 
must  include  numerous  items,  for  example, 

*  Storage  area  to  hold  reconfiguration  programs 

*  Extra  processing  power  to  overcome  the  multiprocessor  penalty 

*  Redundant  busses 

*  Cache  memories  to  overcome  bua  traffic  delays 

*  Switches  to  accomplish  reconfigurations 

*  Storage  areas  to  hold  rollback  status. 

Similar  measures  are  meaningful  for  software  costs  and  execution  time. 

The  priorities  among  the  system  goals  may  have  major  effects  on  the 
resulting  systems,  and  may  call  for  widely  differing  architectures.  In 
our  consideration  of  various  architectures  in  this  report,  vs  attempt  to 
evaluate  the  redundancy  required,  at  least  with  sufficient  accuracy  that 
a  gross  estimate  of  system  cost  is  possible.  Included  are  both 
low-redundancy  design  approaches  for  single-fault  tolerance  and 
higher-redundancy  approaches  for  multiple-fault  tolerance. 


2.2.  SEVERAL  ILLUSTRATIONS  OF  FAULTY  SYSTEM  BEHAVIOR 

Several  recent  examples  of  failures  in  contemporary  systems  are 
instructive.  The  first  provides  a  perspective  overview,  and  concerns 
Multlcs  (Saltzer  A2),  a  system  with  little  hardware  redundancy  but  with 
file  availability  attained  through  software.  Here  outages  fall  into 
three  roughly  equal  categories:  hardware,  software,  and  operations.  To 
make  matters  more  complicated,  system  development  typically  has  gone  on 
concurrently  with  the  operation  of  the  production  system,  either 
simultaneously  on  the  same  (two-processor)  system  or  separately  with  the 
two  processors  partitioned.  The  hardware  problems  are  fairly 
traditional  (e.g.,  processor  problems,  memory  errors,  etc.),  although 
the  Multlcs  software  is  tolerant  of  many  input-output  and  secondary 
storage  errors  in  terms  of  providing  continued  availability.  The 
software  problems  are  due  mostly  to  new  bugs  introduced  by  the 
concurrence  of  the  development  effort,  with  new  system  versions  being 


16 


installed  as  often  as  once  a  week.  (This  is  in  contrast  to  Ob/360,  in 
which  it  appears  that  even  the  occasional  new  "debugged"  release  had 
some  large  number  of  bugs  --  the  censiint  "1000"  is  popularly  cited.) 
Similarly  in  operations,  at  least  half  of  the  problems  are  the  direct 
result  of  the  development  process,  arising  through  manual 
reconfiguration  (due  to  a  hardware  design  not  intended  for  dynamic 
self-reconfiguration),  or  through  changes  in  operating  procedure.  The 
remaining  operational  problems  are  typical,  e.g.,  power  outages.  Thus, 
about  half  of  the  problems  are  attributable  to  the  coexistence  of  the 
development  effort.  The  pattern  of  roughly  equal  distribution  of 
failures  due  to  hardware,  software,  and  operations  is  found  in  many 
systems.  The  frequency  of  failures  seems  to  diminish  greatly  if 
experimentatioi  lows  down  and  production  is  stressed. 

The  second  example  involves  the  outage  of  a  No.  1  ESS  (Ulrich  A2)  office 
in  lashville  (at  night),  involving  total  outage  of  a  few  hours,  with 
partial  outage  for  ten  hours.  This  was  preceded  by  accumulated  errors 
in  the  call  store  combined  with  inadequate  responses  of  the  operating 
and  maintenance  staff,  eventually  triggered  by  malfunctions  in  both 
halves  of  the  system. 

The  third  example  concerns  the  Market  i)eta  System  MDS1  of  the  New  York 
Stock  Exchange ,  operating  with  dual  systems.  After  satisfactory  system 
validation  prior  to  the  opening  for  business  on  Feu.  24,  1972,  system  A 
experienced  a  crash  four  minutes  into  the  market  session.  Automatic 
recovery  was  successfully  invoked  within  a  few  seconds  by  switching  to 
system  B,  and  correct  operation  continued  with  no  loss.  After  off-line 
maintenance  of  system  A,  the  contents  of  drum  B  were  copied  onto  drum  A, 
and  both  drums  were  again  on-line.  Unfortunately,  during  the  time  since 
the  morning  validation,  drum  B  had  developed  a  faulty  master  record, 
which  was  subsequently  accessed.  This  caused  system  B  to  halt.  Control 
was  automatically  switched  to  system  A,  whereupon  system  A  also  halted 
on  the  copy  of  that  record.  Manual  recovery  was  lengthy,  and  the  total 
outage  lasted  29  minutes,  the  worst  in  seven  years  of  operation.  (The 
New  York  Stock  Exchange  has  since  cut  over  to  MDS2,  with  three  360/50s 
and  a  duplicated  large  core  storage  for  files.) 


The  fourth  example  is  that  of  a  telephone  system  in  Kuala  Lumpur  which 
collapsed  twice  in  two  years,  with  significant  hardware  damage. 
Subsequent  analysis  finally  determined  that  each  of  these  events 
occurred  just  a  few  minutes  before  post  time  on  the  day  of  the  annual 
horse  race,  damaging  part  of  an  exchange  serving  a  community  noted  for 
its  gambling  spirit.  Unfortunately,  the  operations  personnel  were  all 
at  the  track  at  the  time,  and  could  not  notice  the  sudden  overload  in 
attempted  calls. 

The  first  example  illustrates  problems  that  can  be  overcome  by 
administrative  control  and  by  further  isolating  a  development  effort 
from  production.  It  also  illustrates  the  enormous  difficulty  of 
discriminating  among  hardware-induced  and  software-induced  errors. 
Multics  and  CP-67  provide  environments  in  which  noncritical  development 
of  software  can  be  debugged  on-line  within  a  production  system. 
Nevertheless,  final  debugging  of  critical  system  software  is  not  easy 
without  a  separate  system,  including  real  users  in  a  real  environment. 
This  problem  is  often  very  difficult,  as  in  the  case  of  the  development 
for  the  Interface  Message  Processors  (IMPs)  in  the  ARPA  Network. 

The  second  example  illustrates  the  typical  overdependence  on  the  need 
for  good  field  engineers.  In  some  cases  high  quality  maintenance  is 
possible  (e.g.,  in  the  FAA  air  traffic  control  system,  where  the  number 
of  centers  is  small).  In  the  ESS  case,  where  many  systems  are  involved, 
the  problem  becomes  critical.  If  a  skilled  engineer  is  required  at  all 
times  at  each  installation,  the  system  is  poorly  engineered.  If  he  is 
required  only  rarely,  but  then  urgently,  it  is  difficult  to  staff  all 
centers  with  sufficiently  skilled  and  motivated  personnel.  The  need  for 
systems  not  requiring  emergency  maintenance  is  thus  very  great, 
especially  when  many  systems  exist  at  distributed  locations,  each  with 
strict  availability  requirements. 

The  third  example  illustrates  the  fact  that  software  is  usually  never 
debugged  an  ver  finished,  as  demonstrated  by  an  unanticipated 
situation  which  haa  never  arisen  in  seven  years'  operation.  There  are 
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many  other  tales  of  systems  in  which  long-standing  hardware  and/or 
software  bugs  (in  this  case  the  lack  of  validation  during  copying)  were 
discovered  only  after  years  of  rperation.  In  some  of  these  cases, 
considerable  reexamination  of  the  correctness  of  earlier  results  was 
required.  Nevertheless,  the  MDS1  system  was  quite  remarkable  in  that  it 
used  off-the-shelf  equipment  and  recorded  a  highly  successful  record  of 
availability  in  its  lifetime. 


The  fourth  example  illustrates  the  danger  in  taking  advantage  of  an 
apparently  reasonable  design  assumption.  In  this  case  ic  was  clearly 
unwise  to  assume  that  traffic  consisted  of  essentially  independent 

random  calls. 

Chapter  3  which  follows  discusses  techniques  for  fault  tolerance,  the 
effects  of  faulty  behavior  and  the  recovery  from  it. 
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CHAPTLR  3.  TECHNIQUtS  FOR  FAULT  TOLLRANCL 


This  chapter  reviews  basic  principles  and  techniques  in  the  present  art 
of  design  for  fault  tolerance,  and  demonstrates  their  use  in  realizing 
economical  system  architectures.  Section  3.1  reviews  existing  and 
proposed  design  techniques  for  fault  tolerarce  (applicable  in  hardware 
and  in  software).  These  include  techniques  for  error  detection,  error 
confinement,  fault  location,  reconfiguration,  and  recovery. 

Section  3.2  exami.i“s  the  developing  art  of  applying  structure  to  system 
design  and  implementation,  including  the  role  of  explicit  structural 
levels  in  partitioning  the  hardware,  the  software,  and  the  microware. 
Concepts  of  criticality  are  discussed.  The  use  if  time-space  tradeoffs 
useful  in  facilitating  economical  fault  tolerance  is  investigated.  Also 
considered  is  the  role  of  system  structure  in  achieving  rapid  recovery 
from  faults  not  completely  tolerated. 

Section  3.3.  examines  the  application  of  these  techniques  to  the 
realization  of  economical  systems  and  networks.  Various  architectural 
types  are  considered.  Their  relevance  to  specific  applications  is 
discussed  in  Chapter  6. 

Detailed  techniques  for  memory,  r.nd  for  arithmetic  and  logic,  are 
discussed  in  Chapters  4  and  5,  respectively. 

We  assume  here  the  ise  of  intrinsically  reliable  technologies  and  of 
sound  engineering  practice  (e.g.,  good  component  engineering  and  careful 
quality  control).  We  recommend,  but  do  not  discuss  in  detail,  the  use 
of  techniques  for  system  modeling,  reliability  analysis,  and  the  formal 
verification  of  design  properties.  We  assume  the  existence  of  good 
system  development  practice  (including  the  use  of  suitable  development 
tools,  e.g.,  languages,  debuggers,  and  test  environments)  and  good 
operating  practice  (e.g.,  avoidance  of  simultaneous  system  development 
except  under  highly  controlled  circumstances).  These  techniques  are 
particularly  important  in  the  attainment  of  good  fault  tolerance. 


Preceding  page  blank 
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We  also  recommend,  but  do  not  consider  In  detail,  techniques  for  the 
achievement  of  a  stable  physical  operating  environment.  These  include 
the  use  of  highly  distributed  reliable  power  supplies  to  minimize 
outage.  For  continuous  availability,  the  use  of  standby  batteries  ar.d 
generators  is  desirable. 


3.1  DESIGN  TECHNIQUES  FOR  r AULT- TOLERANT  SYSTEMS 


In  this  section,  we  clascify  and  evaluate  techniques  for  designing 
fault-tolerant  computer  systems.  Tie  basic  techniques  are  summarized  in 
Table  3.x.  These  techniques  include  novel  techniques  discussed  in 
detail  here  and  well-known  techniques  which  are  given  for  completeness. 

The  following  operations  are  basic  to  the  attainment  of  fault  tolerance. 

ERROR  DETECTION.  An  error  is  detected  when  a  discrepancy  signal  is 
received  by  some  subsystem  that  can  take  action  to  circumvent  the  error. 


ERROR  CONFINEMENT.  Errors  should  be  confined  as  much  as  possible  within 
particular  interfaces  until  some  correction  mechanisms  can  be  invoked. 

FAULT  LOCATION.  A  fault  (faults)  must  be  pinpointed  to  some  unit. 

RECONFIGURATION.  A  faulty  unit  must  be  removed,  replaced,  or  worked 
around, 

RECOVERY.  In  the  case  of  error  propagation,  it  may  be  necessary  to 
restart  some  processes  at  some  error-free  state  in  order  to  perform  lost 
computation  and  restore  lost  files.  It  may  also  be  necessary  to  restore 
the  system  itself  to  a  viable  state. 

Most  of  the  techniques  discussed  here  are  fairly  well  known  and  well 
understood.  We  give  special  attention  to  some  of  the  cases  where  more 
research  is  required.  It  is  clear  that  a  system  can  be  designed  to 
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tolerate  faults  occurring  independently.  The  challenge  is  to  achieve  a 
design  that  is  not  too  costly  in  terms  of  hardware,  and  that  can 
tolerate  realistic  faults  --  including  certain  dependenr  faults. 

However,  the  design  should  also  be  modifiable  as  reliability  and 
availability  needs  change, 

SELECTIVE  AND  DYNAMIC  USAGE  OF  FAULT-TOLERANCE  TECHNIQUES.  The 
techniques  of  Table  3.1  may  be  used  in  different  ways  with  respect  to 
space  and  time.  In  space  (e.g.,  within  a  memory  or  a  processor),  a 
technique  may  be  used  UNIFORMLY  (one  approach  throughout)  or  SELECTIVELY 
(applied  only  in  certain  places).  In  time,  it  may  be  used  STATICALLY  or 
DYNAMICALLY.  STATIC  usage  concerns  actions  with  no  changes  over  time  in 
the  operating  environment  or  in  the  flow  of  control  (e.g.,  fault-masking 
via  coding,  fixed  replication  with  voting).  DYNAMIC  usage  concerns 
fundamental  variations  in  the  control  (e.g.,  in  the  sequencing  or  in  the 
configuration),  such  as  in  detection  followed  by  diagnosis,  rollback, 
and  replacement,  or  as  in  the  use  of  replication  configured  only  on 
demand  of  the  software.  As  seen  below,  there  are  significant  advantages 
(e.g.,  cost  savings)  that  result  from  selective  and  dynamic  usage. 
Examples  of  these  modes  of  usage  are  given  it.  Section  3.2. 

Numerous  specific  aspects  of  each  of  these  five  operations  are  discussed 
below.  For  the  present  discussion,  however,  we  wish  to  emphasize  the 
following  points: 

*  In  relatively  trivial  fault-tolerant  systems,  not  all  of  the  five 
operations  are  distinctly  identifiable.  In  a  system  that  employs  just 
triplication  with  voting,  for  example,  the  error  confinement  process 
embodies  the  other  four  operations. 

*  These  operations  can  be  carried  out  by  varying  combinations  of 
hardware,  software,  and  microcode.  Fault  tolerance,  therefore,  is  a 
distributed  function  which  may  be  implemented  at  various  '■omputational 
levels . 
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Table  3.1. 

SIMMARY  OF  MAJOR  DESIGN  TECHNIQUES  FOR  FAULT  TOLERANCE 


“I 

DETECTION 

Coding:  error  detection 
Double-rail  encoding  for  logic 

Duplication  in  apace  or  time  and  comparison  (hard  ov  soft) , 

Consistency  checks  (e.g.,  algorithmic  checks,  read  after  write,  back 
substitution,  partial  Floyd  assertions) 

Probabilistic  detection 
Deferred  detection 

Detection  as  a  byproduct  of  diagnosis,  periodic  or  otherwise 

PREVENTION  OF  ERROR  PROPAGATION,  AND  LOCAL  CORRECTION 

Delaying  the  results  until  validated 

Coding:  error  correction  (Hamming,  burst,  byte) 

Replication  with  voting 

Isolation,  e.g.,  via  powering  off,  reconfiguration  and  fail-safe 
switching,  fall-safe  structural  design  (esp.  involving  protection 
and  Interrupts) x  use  of  read-only  memories,  asynchronous  decoupling, 
clock  independence 

LOCATION  OF  FAULTS  OR  ERRORS 

Coding:  error  location  (also  implicit  in  error  correction) 

Triplication  (implicitly  error  locating) 

Diagnosis,  possibly  with  reconfiguration  for  testing 

RECONFIGURATION  AROUND  FAULTY  UNITS 

Removal  (deconfiguration)  with  degradation 

Reconfiguration  around  a  fault  contextually  (without  its  elimination) 
Replacement  by  switching  of  standby  spares 
Replacement  physically 

RECOVERY 

Single-instruction  retry  with  buffered  operands 
Rollback  to  a  program  checkpoint,  with  manual  or  automatic 
checkpointing 

Audit  trails  to  facilitate  subsequent  recovery 
Interpretive  recovery  (e.g.,  unwinding,  salvaging,  selective  file 
retrieval) 

Bootstrap  recovery  from  fixed  point  (with  side  effects) 
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*  In  effectively  using  the  fault-tolerance  techniques  selectively  and 
dynamically,  there  are  fairly  well-defined  trade-offs  among  the  time 
required  to  carry  out  one  of  these  operations,  the  hardware  redundancy 
required,  and  the  probability  of  successfully  carrying  out  the 
operation.  For  example,  deferred  detection  and/or  correction  of 
arbitrary  logic  may  produce  significant  cost  savings. 

3.1.1.  ERROR  DETECTION 

One  of  the  main  problems  in  achieving  low-cost  fault  tolerance  is  the 
problem  of  achieving  economical  error  detection.  Aside  from 
well-structured  situations  such  as  core  memories,  parallel  adders,  tape 
memories  and  bus  transfers,  error  detection  with  less  than  50% 
redundancy  (duplication)  has  remained  unsolved. 

When  a  system  fails,  its  failuia  is  often  obvious  to  a  human.  A 
terminal  may  appear  dead  (e.g.,  because  of  a  system  crash  or  a  loop  in 
his  program),  or  his  results  ray  appear  to  be  wrong.  Internally,  many 
harmful  errors  are  similarly  vast,  e.g.,  involving  alteration  in  the 
flow  of  control.  The  reasons  for  wide  discrepancies  between  expected 
results  and  actual  results  include  the  following: 

*  A  faulty  logic  circuit  is  sometimes  used  repeatedly  in  the  absence  of 
internal  error  detection,  thus  increasing  the  chance  of  a  readily 
discernible  error. 

*  Many  simple  hardware  faults  (permanent  or  transient)  have  a  nrascic 
affect  on  program  control,  ’.8»»  directing  control  to  an  incorrect 
instruction  or  addressing  the  wrong  memory  location.  Other  faults  may 
not  affect  control.  Also,  many  computations  do  not  allow  simple 
consistency  checks.  Thus  more  general  and  problem-independent  error 
detection  mechanisms  are  essential. 

.  t 

In  spite  of  the  ease  of  some  error  detection  to  a  human,  error  detection 
can  be  a  costly  operation  when  carried  out  automatically.  An 
arbitrarily  structured  processor  using  known  error  detection  techniques 
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seems  to  require  at  least  50Z  redundancy  to  permit  immediate  detection 
of  all  possible  processor  errors.  Fortunately,  this  cost  does  not  carry 
over  to  total  s;stem  cost  for  error  detection,  for  various  reasons.  For 
example,  the  processor  may  be  a  small  part  of  the  total  system;  much 
more  palatable  detection  "chemes  exist  for  memories,  channels,  etc. 
Similarly,  there  is  no  need  to  detect  all  possible  errors,  and  also 
immediate  error  detection  is  not  needed.  The  basic  methods  are  outlined 
below,  leaving  to  later  sections  the  detailed  discussions  and 
evaluations . 

It  is  clear  that  errors  can  be  detected  with  any  desired  degree  of 
completeness,  and  vith  any  desired  degree  of  immediacy.  The  challenge 
is  to  achieve  such  detection  with  low  redundancy. 

3. 1.1.1  ERROR-DETECTING  CODES 


Error-detecting  codes  make  it  possible  to  detect  the  first  occurrence  of 
an  error  at  some  particular  Interface,  e.g.,  memory,  channels,  an 
arithmetic  logic  unit,  a  processor,  or  the  entire  computer.  Section 
3.1. 1.3  below  discusses  the  possibilities  of  deferred  error  detection. 

Codes  with  a  single  parity  bit  for  each  memory  word  are  widely  used  for 

error  detection  in  memory.  Such  codes,  with  negligible  redundancy,  are 

useful  for  detecting  single  core  errors  or  sense  amplifier  failur.es,  or 

any  channel  failure  that  results  in  a  single  bit  being  in  error.  ‘T'is 

concept  extends  to  the  detection  of  a  fault  within  an  arithmetic  unit. 

For  example,  if  an  error  is  additively  incorrect  by  a  power  of  two,  a 
* 

residue  code  is  useful,  e.g.,  where  the  redurdant  digits  represent  the 
residue  of  the  word  uodulo  3. 

This  approach  also  generalizes  to  the  case  where  a  fault  produces  an 
error  in  a  single  b-bit  byte  of  memory  or  a  channel,  or  in  a  b-bit  byte 
of  an  arithmetic  processor.  In  memory,  b  bits  of  redundancy  suffice  to 
detect  errors  in  a  single  byte  for  words  of  arbitrary  length.  In 
arithmetic,  similar  redundancy  is  required  (see  section  5,1),  The 
arithmetic  codes  may  also  be  used  in  memory.  The  byte  error  situation 
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is  a  natural  consequence  of  a  byte-sliced  memory  or  arithmetic  unit, 
wherein  each  byte  is  realized  as  a  single  LSI  chip. 

Unfortunately,  codes  for  detecting  errors  in  a  single  bit  or  byte  are 
not  effectively  extendable  to  arbitrary  logic.  (An  exception  is  a 
function  realized  exclusively  with  linear  logic.)  One  can,  in 
principle,  design  a  logic  unit  such  that  at  every  interface  the  vector 
of  signals  is  an  error-detecting  code  word  in  the  absence  of  a  fault,  or 
not  a  code  word  in  the  presence  of  a  fault.  Given  n  independently 
realized  output  functions,  the  simplest  way  to  do  this  is  to  provide  a 
circuit  with  an  output  which  is  functionally  the  modulo  two  sum  (parity 
check)  of  the  n  outputs  (Lofgren  5C).  Thus  faults  producing  an  odd 
number  of  errors  are  detectable.  However,  this  simple  approach  is  not 
practical  for  the  following  reasons. 

*  For  most  practical  functions,  the  semi-empirical  results  of  Fierce 
(65)  indicate  that  the  cost  of  realizing  the  redundant  function  output 
may  approach  the  combined  cost  of  realizing  the  functions  themselves. 
Thus,  on  a  component  count  basis,  this  approach  may  be  as  bad  as 
duplication. 

*  For  nonindependent  realizations  of  the  n  outputs,  a  single  gate  fault 
is  likely  to  corrupt  more  than  one  output  (e.g.,  an  even  number  of 
outputs),  especially  in  an  LSI  environment.  Recent  work  by  Ko  (73) 
suggests  possible  circuit  augmentations  that  ensure  the  corruption  of 
only  an  odd  number  of  outputs.  This  work  indicates  that  some  functions 
exist  for  which  the  total  component  count  is  less  than  that  of 
duplication,  but  these  functions  tend  to  be  exceptional,  besides,  the 
relative  saving  seems  to  be  insignificant  in  practice. 

*>  Favorable  results  seem  to  rely  upon  the  model  of  a  single  faulty  gate 
with  just  one  stuck-at-fault,  which  is,  jf  course,  not  a  reasonable 
assumption  for  MSI /LSI  circuits. 

*  Finally,  this  approach,  is  intended  for  multiple-output  functions  with 
the  same  set  of  inputs.  For  a  single  output,  it  reduces  to  duplication. 
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In  general  the  nonapplicability  of  coding  techniques  for  logic  tenas  to 
reinforce  the  early  pessimistic  results  of  Elias  (58)  derived  for  a 
8®r^®I  logic  unit,  showing  the  necessity  of  duplication  for  error 
detection  in  a  single  AND  gate.  The  only  exceptions  to  the  apparent  50Z 
redundancy  are  specialized  functions.  For  example,  a  single  gate  in 
error  in  a  tree-realized  memory  decoder  results  in  either  the  selection 
of  no  word  at  aU,  or  the  selection  of  multiple  words.  If  it  is  assumed 
that  the  accessed  words  are  ANDed  together  (or  ORed  together)  in 
corresponding  bit  positions,  then  a  comparatively  economical  code  can  be 
used  for  error  detection.  Each  n-bit  word  is  encoded  so  that  half  of 
the  bits  positions  contain  a  "l"  and  half  contain  a  "0",  e.g.,  the  "n/2 
out  of  n"  codes.  The  redundancy  is  quite  low  (e.g.,  about  10Z  for 
32-bit  words,  less  for  longer  words).  The  encoding  and  decoding  cost  is 
small  relative  to  the  total  memory  cost,  although  it  is  higher  than  that 
for  single  error  correction  (Anderson  and  Metze  73).  This  code  can 
detect  arbitrarily  many  multiple  errors  if  they  are  all  of  the  same  type 
(e.g. ,  either  all  0  to  1,  or  all  1  to  0) . 

One  other  coding  scheme  has  been  suggested  for  possible  use  in  the  PRIME 
(Borgerson  A2),  to  detect  address  decoder  failures  or  memory 
bit-liae  failures.  If  a  single  error  occurs  in  the  address  decoder, 
then  a  word  will  be  accessed  whose  address  is  addiMvely  incorrect  by 
some  power  of  two.  3y  noting  the  similarity  with  the  effect  of 
arithmetic  errors,  it  is  clear  that  this  type  of  address  decoder  fault 
can  be  handled  by  appending  to  each  memory  word  the  modulo  3  residue  of 
the  address.  Thus,  this  scheme  detects  any  single  error  In  memory  or  in 
the  address  decoder. 

3. 1.1.2.  DUPLICATION 


The  essence  of  duplication  is  simple  and  straightforward.  Results  are 
independently  computed  twice,  and  the  results  compared.  If  the  results 
are  binary-valued,  a  disagreement  indicates  that  one  of  the  computations 
is  in  error.  (If  the  results  are  multiple-valued,  both  may  be  in 
error.)  The  identification  of  the  erroneous  computation  is  deferred  to 
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a  more  elaborate  diagnosis.  (A  coding  purist  would  contend  that 
duplication  really  involves  a  trivial  error  detecting  code,  however, 
since  there  are  interesting  engineering  details  concerning  the 
application  of  duplication1  at  vai^.ous  system  levels,  it  is  worthwhile  to 
discuss  this  apart  from  any  codiij  theory  implications.) 

Although  duplication  is  in  principle  applicable  to  any  subsystem,  it  has 
primary  application  where  less  costly  techniques  are  Inadequate.  That 
is,  duplication  is  used  for  error  detection  where  better  techniques  do 
not  work.  Generally,  duplication  may  be  used  in  conjunction  with 
arbitrary  logic  in  processors,  I/O  control  units,  special  control 
circuits,  r.nd  some  memory  functions.  There  is  clearly  little  need  to 
use  duplication  in  conjunction  with  storage  in  main  memory,  except 
possibly  in  certain  critical  applications. 

Duplication  may  be  employed  in  SPACE  (using  two  identical  units)  or  in 
TIME.  In  TIME  DUPLICATION,  only  one  unit  is  used  to  perform  the  same 
computation  twice  (but  perhaps  internally  reconfigured  or  shifted) 
before  the  computation  is  accepted  as  error  free.  Time  duplication  is 
less  credible  in  that  it  depends  on  the  equipment  being  exercised  in 
different  modes  in  order  that  the  two  computations  do  not  agree  because 
of  identical  or  compensating  errors.  Variations  and  combinations  of 
space  and  time  duplication  are  also  known.  For  e.’ample,  two  supposedly 
complementary  versions  of  a  result  may  be  generated.  For  processors 
with  an  iterative  structure,  output  data  nay  be  computed  twice,  but  with 
permuted  assignment  over  the  identical  modules. 

The  most  obvious  and  connon  practice  of  duplication  is  to  make  a 
comparison  on  every  machine  cycle.  On  the  other  hand,  if  the  comparison 
can  be  deferred,  there  may  be  an  advantage  to  performing  it  in  software. 
However,  a  software  implementation  requires  the  careful  isolation  of 
uncompared  results  to  prevent  error  propagation.  A  software 
implementation  also  requires  separate  working  memories  for  the  pair  of 
processors  to  hold  intermediate  (i.e.,  uncompared)  results.  Most  fault 
tolerant  systems  employ  hardware  duplication  to  avoid  the  error 
propagation  problem,  but  it  is  our  opinion  that  careful  attention  to 
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recovery  issues  can  lead  to  a  feasible  software  implementation,  at  a 
substantial  saving  in  hardware  cost. 

Another  important  engineering  detail  is  the  LEVEL  OF  PARTITIONING.  The 
issue  here  is  the  identification  of  the  interfaces  at  which  comparisons 
are  to  be  made.  F^r  example,  the  comparison  interfaces  can  feasibly  be 
at  the  system  le^el  (e.g.,  comparing  the  results  of  subroutines  or 
procedure  calls  on  exit),  at  the  processor  level  (e.g.,  comparing  two 
processors  nominally  executing  identical  instructions),  and  at  the 
subprocessor  level  (e.g.,  comparing  the  outputs  of  byte  slices  of  an 
arithmetic  unit). 

In  most  fault-tolerant  systems,  the  error  detection  interfaces  define 
the  units  to  be  removed  in  combatting  faults.  That  is,  if  the 
arithmetic  unit  is  a  replaceable  unit,  then  there  usually  exists  some 
mechanism  for  detecting  possible  errors  in  the  signals  emerging  from 
that  unit.  In  ary  event,  systems  proposed  for  high-reliability, 
long-life  applications  typically  employ  a  partitioning  for  error 
detection  at  a  low  system  level.  On  the  other  hand,  for  most 
applications  of  concern  here,  detection  at  the  processor  level  or  merory 
unit  level  probably  suffices.  The  roles  of  various  levels  in  a 
fault-tolerant  system  are  discussed  in  more  detail  in  Section  3.2. 

3. 1.1.3.  DEFERRED  DETECTION 

It  is  often  not  essential  to  detect  a  fault  or  an  error  as  soon  as  it 
occurs.  If  the  detection  of  a  fault  or  error  can  be  DEFERRED,  it  is 
possible  to  reduce  the  redundancy  requirement  for  detection.  Deferred 
detection  may  be.  performed  COMPLETELY  (deterministically)  subsequent  to 
the  occurrence  of  a  fault,  e.g.,  on  exit  from  a  computational  block.  It 
may  also  be  performed  PROBABILISTICALLY,  if  over  some  period  of  time, 
there  is  a  probability  p  that  the  error  is  detectable  (e.g.,  in  terms  of 
a  syndrome  or  other  discrepancy).  Three  application  areas  of  deferred 
detection  are  relevant. 
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NON  UNIFORM  DETECTION.  In  many  fault  tolerant  systems,  error  detection 
facilities  are  applied  uniformly  to  all  processes.  In  many  cases, 

errors  can  be  allowed  to  occur  without  serious  consequences,  i.e.,  the 
errors  are  non-critical.  We  see  a  possibility  for  some  economies  in 
fault-tolerant  equipment  by  applying  error  detection  on  tne  basis  of 
criticality. 

INCIDENTAL  DETECTION.  In  some  cases  it  is  simply  hoped  that  sooner  or 
later  (hopefully  sooner)  errors  will  be  detectable  without  the  use  of 
much  extra  redundancy.  This  may  be  acceptable  in  low-cost  units,  or  in 
cases  in  which  the  input  state  sequence  is  highly  predictable. 

UNFLEXED  DETECTION.  An  output  which  changes  only  rarely  from  its 
nominal  state  needs  special  detection.  A  pertinent  example  here  is  a 
fault  in  the  decoder  for  an  error-detecting  code  that  results  in  a 
constant  "no-error"  condition  being  emitted,  even  in  the  presence  of 
errors.  Similarly,  certain  system,  functions  that  are  executed 
extremely  rarely  also  require  special  detection.  A  latent  fault  in  such 
a  rarely  used  function  could  remain  undetected  and  eventually  result  in 
a  system  failure.  This  problem  is  called  the  UNFLEXED-FUNCTION 
DETECTION  problem.  In  general  faults  in  such  functions  need  not  be 
detected  as  soon  as  they  occur,  e.g.,  because  another  hardware  fault 
must  occur  before  this  function  is  required.  Hence  the  detection  of 
such  faults  can  be  deferred,  i.e.,  carried  out  probabilistically. 

An  elegant  theory  has  been  developed  to  handle  the  third  area  (e.g., 
Carter  et  al.  72a,  Anderson  and  Metze  72).  For  example,  a  conventional 
error-detection  circuit  might  emit  a  "0"  if  there  is  no  error,  and  a  "1" 
if  there  is  an  error.  Obviously,  any  fault  that  leads  to  a  permanent 
emission  of  "0"  will  remain  undetected.  In  order  to  alleviate  this 
difficulty,  two  or  more  output  lines  are  provided  for  the  decoder.  In 
the  case  of  two  output  lines,  a  "0"  might  correspond  to  00  or  11,  and  a 
"1"  to  01  or  10.  The  decoder  is  designed  such  that  when  there  is  no 
error,  the  decoder  on  its  two  output  lines  emits  00  and  11  with  equal 
probability.  The  decoder  is  designed  such  that  any  single  stuck-at 
fault  within  it  causes  the  output  01  or  10  for  at  least  some  code  word 
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being  presented  at  the  input.  This  latter  property  has  led  this  type  of 
structure  being  called  SELF-TESTINC.  The  important  positive  conclusions 
from  this  work  are  the  following. 

*  There  exist  fault-detection  techniques  that  require  less  than 
duplication  in  implementation,  although  for  incomplete  detection.  For 
the  case  of  the  decoder  for  a  single-error  correcting  (Hamming)  code, 
the  redundancy  is  about  25%  (Carter  et  al.  70a). 

*  There  is  an  alternative  to  periodic  diagnosis  in  detecting  faults  in 
unflexed  circuits. 

The  negative  conclusions  are  these: 

*  The  redundancy  requirements  are  low  only  for  well-structured 
functions,  e.g.,  decoders  for  error-correcting  codes.  For  other 
unflexed  circuits,  duplication  nay  be  as  good. 

*  The  unflexed  circuits  represent  a  low  proportion  of  total  system  cost. 

Thus,  the  incremeiical  cost  of  using  replication  may  be  negligible. 

I 

*  The  fault  model  for  the  circuits  iB  still  concerned  with  single 
stuck-at  faults.  For  more  realistic  faults,  it  is  likely  that 
duplication  is  close  to  optimal. 

*  Self-testing  circuits  appear  to  be  a  good  solution  for  certain 
functions  associated  with  the  unflexed  function  problem.  However,  it  is 
generally  not  clear  that  all  such  unflexed  functions  are  attractive 
candidates  for  self-testing  logic,  when  compared  with  periodic  diagnosis 
and  brute-force  replication. 

3.1. 1.4  ERROR  DETECTION  VIA  DIAGNOSIS 

An  approach  to  error  detection  that  is  potentially  quite  efficient 
involves  periodic  diagnosis  of  the  fault-prone  system  blocks.  A 
CHECKING  SEQUENCE  is  imposed  on  the  inputs  of  the  blocks  in  question 


32 


such  that  if  any  fault  is  present  an  output  value  will  eventually  emerge 
that  differs  from  the  expected  value.  In  order  for  the  diagnosis 
approach  to  L'*  effective,  as  compared  with  say  duplication,  the 
following  features  must  be  given  consideration: 

FAULT  COVERAGE.  Clearly  the  checking  sequence  must  be  capable  of 
revealing  an  extremely  large  fraction  of  the  likely  fault  patterns. 

Most  research  in  fault  diagnosis  has  been  concerned  with  networks  in 
tfhich  faults  are  manifested  as  a  single  gate  being  stuck-at-zero  (SAU) 
or  stuck-at-one  (SA1) .  In  an  LSI  implementation  the  single  stuck-at 
assumption  is  not  valid.  An  imperfection  in  an  LSI  chip  tends  to 
propagate  outward  from  some  source  point.  Thus  it  is  likely  that  gates 
within  a  region  will  be  suspect.  It  is  likely  that  a  checking  sequence 
that  handles  all  SAO  and  SAl  faults  will  handle  a  large  class  of  other 
fault  patterns,  although  there  is  little  formal  work  to  substantiate 
this  conjecture.  With  regard  to  non-formal  work  in  this  area  computer 
manufacturers  have  developed  checking  sequences  to  help  detect  failures 
within  their  CPUs.  Typically,  these  sequences  are  generated  by  ad  hoc 
techniques  and  reveal  only  about  90  percent  of  the  likely  fault 
patterns.  The  conclusion  here  is  that  at  present  the  fault  coverage  is 
not  adequate  for  the  error  detection  function.  However,  we  feel  that  if 
the  research  effort  is  devoted  to  realistic  fault  models,  and  is  coupled 
with  simulation  techniques  this  situation  could  be  alleviated. 

PERIODICITY  OF  CHECKING.  The  checking  sequence  must  be  applied  often 
enough  so  that  the  probability  of  two  faults  occurring  during  the 
intervening  period  is  low.  Also  since  the  faulty  equipment  might  be 
unavailable  during  the  inter  diagnosing  period  this  period  must  be 
shorter  than  the  maximum  tolerable  unavailable  time.  For  all  but  the 
critical  real-time  applications  neither  of  these  constraints  is 
limiting.  It  is  unlikely  that  a  diagnosis  of  any  system  block  needs  to 
be  carried  out  with  a  period  shorter  than  10-100  seconds. 

DIAGNOSIS  OVERHEAD.  The  important  overhead  measures  of  diagnosis  are 
the  amount  of  cpj  effort  devoted  to  diagnosis  and  the  amount  of  high 
speed  memory  needed  to  store  the  checking  sequences.  Concerning  the  cpu 
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overhead  the  typical  length  of  a  checking  sequence  for  arbitrary  logic 
is  one-tenth  the  number  of  gates.  (This  is  our  experience  for  the 
single  stuck  at  model,  but  for  more  realistic  models  the  length  should 
not  increase  by  more  than  a  factor  of  two  or  three.)  Thus  a  10,000  gate 
processor  can  be  checked  with  a  sequence  of  length  1,000.  Assuming  2 
per  test  the  total  diagnosing  time  is  2  msec.  The  cpu  overhead  is  thus 
negligible  for  an  inter  checking  period  of  10  seconds.  For  this  inter 
checking  period  it  is  likely  that  the  test  itself  can  be  stored  on  disk 
thus  precluding  the  need  for  high  speed  storage. 

ERROR  CONFINEMENT  DURING  INTER  CHECKING  PERIOD.  All  computed  results 
are  suspect  until  the  processor  is  diagnosed.  Thus  it  is  necessary  to 
prevent  possibly  faulty  results  from  propaga*-, ng.  In  the  PRIME  system, 
which  utilizes  diagnosis  as  a  primary  error  detection  mechanism  error 
propagation  is  not  a  problem  because  of  in*-cr  processor  isolation.  In 
other  systems  the  error  confinement  techniques  of  Sect.  3.1.2  must  be 
considered. 

FALLIBILITY  OF  DIAGNOSING  SYSTEM.  A  paramount  problem  in  diagnosis 
relates  to  the  problem  of  faulty  behavior  in  the  system  carrying  out  the 
diagnosis.  In  a  system  consisting  of  a  single  processor  the  best 
approach  involves  bootstrapping.  Here  a  small  system,  assumed  to  be 
infallible  carries  out  a  diagnosis  to  verify  the  integrity  of  a  larger 
system.  This  larger  system  then  acts  to  produce  a  still  larger  verified 
system  and  so  on.  The  initial  small  system  can  be  made  error  detecting 
by  duplication  techniques.  In  a  multiprocessor  the  commonly  conceived 
approach  is  to  have  one  processor  diagnose  another.  If  the  diagnosing 
processor  reports  an  error  it  is  not  decidable  which  processor  is 
faulty.  (If  no  error  is  reported  it  can  be  assumed  that  the  diagnosed 
r rocessor  is  operative,  provided  no  more  than  one  processor  is  assumed 
co  be  faulty.)  If  three  or  more  processors  are  available,  various 
strategies  can  be  invoked  to  resolve  the  ambiguity. 

(Preparata  et  al.  69)  have  presented  one  such  strategy  based  upon  a 
circular  configuration  of  diagnosing  processors. 

In  conclusion  periodic  diagnosis  is  potentially  the  most  efficient 
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approach  toward  error  detection.  The  only  foreseeable  limitation  is  the 
inapplicability  to  transient  faults.  For  permanent  faults  additional 
work  is  needed  to  improve  the  fault  coverage  obtainable  with  checking 
sequences ,  particularly  related  to  a  fault  model  that  is  realistic  in  an 
LSI  environment. 

3.1.2.  ERROR  CORRECTION 

The  state  of  the  art  of  coding  for  error  correction  and  efficient  (fast, 
cheap)  decoding  is  well  developed  (e.g.,  Peterson  and  Weldon  72, 
llerlekamp  68).  Error-correcting  codes  exist  for  use  in  memory  and  in 
arithmetic,  for  various  types  of  errors.  Such  types  include  correction 
of  single  errors,  independent  multiple  errors,  and  correlated  errors 
(e.g.,  arbitrary  errors  within  a  byte,  or  confined  to  a  burst  of 
consecutive  digits).  Memory  and  arithmetic  are  covered  in  Sections  4.1 
and  5.1,  respectively.  For  error  correction  in  processors  and  arbitrary 
logic,  triplication  and  voting  is  the  traditional  technique.  Many  of 
the  comments  in  Section  3.1.1  for  error  detection  are  also  extendable  to 
error  correction. 

3.1.3.  RECONFIGURATION  AND  RECOVERY 

Table  3.1  includes  several  items  on  reconfiguration  and  recovery  which 
are  fairly  self-explanatory.  As  seen  in  Section  3.2,  dynamically 
alterable  strategies  are  needed,  including  instruction  retry  and 
recovery  from  a  parity  errcr  in  memory  (depending  on  what  word  was  in 
error,  and  what  it  was  being  used  for).  Reconfiguration  of  memory  is 
discussed  in  detail  in  Section  4.2  and  in  Appendix  3.  Recovery  is 
discussed  in  Section  3.2.5. 
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3.2.  STRUCTURED  DESIGNS  FOR  FAULT  TOLERANCE 


Most  computer  system  designs  seem  to  evolve  in  an  ad  hoc  fashion, 
reflecting  both  the  structure  of  the  organization (s)  to  which  the 
designers  belong  (Conway's  Law)  and  the  lack  of  a  holistic  design  view. 
Here  we  examine  the  role  of  structure  in  system  design,  and  how  it  can 
facilitate  the  effective  use  of  the  above  techniques  for  fault 
tolerance.  (This  section  is  inspired  by  Simon  62,  Dijkstra  65,  68,  69, 
and  Neumann  69,  72,  73.  Also  relevant  is  the  work  of  Homing  and 
Randell  73,  and  Pamas  72.)  Well-conceived  system  structure  can 
contribute  significantly  to  the  design,  implementation,  debugging, 
verification,  testing,  diagnosis,  maintenance  and  operation  of 
fault-tolerant  systems.  As  employed  here,  such  structure  permits  a  wide 
range  of  techniques  to  be  applied  selectively  and/or  dynamically,  when 
and  where  they  are  most  effective  in  terms  of  cost  and  reliability.  Low 
cost  can  be  achieved  by  taking  advantage  of  nonuniform  constraints  and 
various  time-space  tradeoffs.  This  is  in  contrast  to  many  existing 
systems  which  employ  (statically)  primarily  low-level  techniques  for 
fault  tolerance.  A  well  chosen  system  compartmentalization  helps  limit 
error  propagation,  improves  autonomous  maintenance,  and  enables  the 
persistence  of  system  security  in  spite  of  faults;  it  also  facilitates 
long-term  evolutionary  growth  of  the  system,  responsive  to  new 
applications  needs,  new  hardware,  and  new  software. 

Hierarchical  aspects  of  such  structure  permit  a  hierarchical  recovery 
strategy  directly  reflecting  the  structure  of  the  design  and  the  needs 
for  recovery.  Such  a  strategy  can  be  relatively  efficient,  in  that  it' 
can  be  dynamically  tailored  to  the  actual  fault (s).  Recovery  varies 
widely  in  complexity,  depending  on  the  nature  of  the  faulty  behavior. 

It  mcy  be  quite  simple,  as  in  the  case  of  a  detectable  transient  error 
in  arithmetic  (with  buffered  instruction  retry)  or  in  a  memory  with 
error-correcting  coding,  or  it  may  be  quite  complicated,  e.g.,  after  a 
total  collapse  of  the  system.  In  general,  the  recovery  strategy  should 
assure  recovery  of  the  most  critical  parts  of  the  system  first. 

Structured  recovery  strategies  are  found  to  some  extent  in  the  Plessey 
System  250  (Williams  A2)  and  in  Multlcs  (Saltzer  A2) . 


The  system  design  should  integrate  the  needs  for  fault  tolerance  and  for 
effective  recovery  with  the  other  system  needs  of  security,  efficiency, 
capability,  etc.  (the  PRINCIPLE  OF  GLOBAL  DESIGN).  Successful 
integration  is  greatly  facilitated  by  a  highly  structured  design  that 
deals  with  architectural  concepts  irrespective  of  whether  they  are 
implemented  in  hardware,  in  microprogram,  or  in  software,  and  which 
evolves  in  a  roughly  "top-down"  or  goal-driven  fashion.  Since  software 
capability  of  one  generation  is  frequently  found  in  the  hardware  of  the 
next  generation,  this  view  is  highly  appropriate. 

3.2.1.  STRUCTURAL  LEVELS  OF  INVISIBILITY 

The  structure  of  a  system  can  have  considerable  impact  on  che  fault 
tolerance  of  the  system,  as  well  as  on  the  system  devel-r.ment  as  a 

whole.  Although  this  subsection  considers  the  role  of  such  structure  In 
general,  it  provides  a  basis  for  fault  tolerance  throughout  this  report. 
Of  interest  in  this  subsection  are  the  interrelations  that  form  the 
structure  among  the  various  system  mechanisms.  At  the  interface  to  each 
system  mechanism,  various  implementation  details  may  be  hidden  from  the 
invocation  of  that  interface.  When  an  interface  to  a  mechanism  makes 
such  implementation  details  invisible,  that  mechanism  is  said  to  be  a 
virtual  mechanism  (see  below).  The  interface  provides  a  level  of 
invisibility  between  its  invocation  and  its  implementation. 

There  are  many  different  structural  views  of  the  mechanisms  within  a 
computing  syptern,  both  system-oriented  and  user-oriented.  The 
techniques  for  fault  tolerance  may  be  applied  at  various  levels  with 
respect  to  any  of  several  such  views.  Consider  first  several 
system-oriented  views.  With  respect  to  hardware  dependence,  levels  of 
structure  vary  from  components  to  subunits  to  functional  units  to  a  raw 
machine  to  a  microprogrammed  machine  through  various  levels  of  software 
support  to  a  network  of  systems.  Corresponding  levels  of  language 
capability  (above  the  circuitry  levels)  range  from  microprogram 
instructions  to  machine  language  instructions  to  macro-assembly  and 
compiler  statements  through  various  levels  of  block  structure, 
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subroutine,  and  operating  system  calls,  to  system  commands  and  network 
commands.  A  command  itself  may  be  substructured,  with  various  levels  of 
subrequests  and  requests  within  it.  In  units  of  time,  levels  of 
responsiveness  range  over  a  wide  spectrum  of  response  requirements,  with 
different  mechanisms  requiring  responses  of  picoseconds  to  nanoseconds 
to  microseconds  to  milliseconds  (e.g. ,  for  peripherals)  to  seconds 
(e.g. ,  for  human  interaction),  e^.  Within  a  sy«'«ui,  different  sets  of 
levels  exist  with  respect  to  processors,  memories,  input-output, 
control,  and  intercommunication.  In  memories,  for  example,  such  levels 
range  from  storage  for  a  bit  of  information  to  storage  for  (encoded) 
representations  of  words  to  blocks  to  memory  modules  to  a  hierarchy  of 
diverse  typrs  of  memdries,  e.g.,  managed  (in  hardware  and  software)  as  a 
single  level  of  n*>aory  and  organized  into  a  directory  structure  (e.g., 
as  directories  of  directories  of  files).  In  communication,  levels  range 
from  lnfraprocessor  communication  to  interprocessor,  Intersystem,  and 
even  internetwork  communication.  Other  levels  that  are  more  or  less 
orthogonal  to  the  above  levels  are  also  distinguishable,  e.g.,  the 
levels  of  reliability  and  protection  discussed  in  Section  3.2.2. 

A  system  in  execution  is  controlled  in  hardware  and  in  software  by  its 
OPERATING  SYSTEM,  and  may  be  viewed  overall  as  a  collection  of 
PROCESSES,  Each  process  is  a  single  locus  of  sequential  control, 
relative  to  some  address  space,  A  process  may  Invoke  or  create  other 
processes,  but  in  itself  may  not  have  multiple  simultaneous  "threads"  of 
execution.  Thus  a  process  is  the  basic  unit  of  asynchronous  processing. 
Each  process  may  be  thought  of  as  using  a  VIRTUAL  PROCESSOR,  i.e.,  a 
processor  exclusive  to  that  process.  The  address  space  of  each  process 
is  its  VIRTUAL  MEMORY,  with  just  that  information  (stored  in  a  portion 
of  actual  memory)  which  is  directly  accessible  to  the  process.  The 
virtual  memory  provides  a  (simplified)  Interface  to  the  real  memory,  and 
makes  the  management  of  actual  memory  largely  invisible.  The  operating 
system  may  be  thought  of  as  multiplexing  the  various  processes  onto  the 
actual  system,  and  multiplexing  the  corresponding  virtual  memories  onto 
the  apparent  single  level  of  memory.  At  this  level  the  mechanisms  of 
MULTIPROGRAMMING  (i.e.,  the  concurrent  use  of  main  memory  by  several 
processes)  are  invisible.  The  operating  system  may  itself  be  executing 
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in  a  MULTIPROCESSING  code,  i.e.,  if  it  is  able  to  run  on  multiple 
processor*,  simultaneously.  Some  of  the  operating  system  processes  may 
be  allocated  dynamically  to  special-purpose  processors,  while  others  may 
be  permanently  dedicated  to  specific  hardware. 

Each  process  also  has  facilities  for  input-output.  Here  levels  range 
from  data  representations  on  devices  and  on  media  to  data  structures 
(e.g.,  bytes,  characters,  records,  files)  to  various  forms  of  VIRTUAL 
INPUT-OUTPUT  (with  invisibility  of  many  details  of  device  dependence,  of 
multiprocessing  and  of  multiprogramning,  e.g.,  via  virtual  devices  with 
invisible  formatting  and  symbolic  device  attachments).  The  system  is 
responsible  for  multiplexing  the  actual  input-output  devices  and  media. 

There  are  various  levels  of  process  structure,  from  protection  domains 
within  processes,  to  processes  within  a  system,  to  intrasystem  and 
intersystem  process  families.  From  the  view  of  a  single  "user"  (whether 
lie  is  a  casual  turn-key  user,  a  systems  program  developer,  or  an 
environment  being  controlled  by  or  controlling  the  computer  system),  he 
may  see  a  single  process.  He  may  also  wish  to  distribute  a  job  among 
several  asynchronous  processes  within  a  FAMILY  OF  PROCESSES.  In  the 
presence  of  multiple  processors,  this  leads  to  multiprocessing  at  the 
user  level.  Ills  process  family  makes  many  process  mechanisms  Invisible. 
Each  usir  has  his  own  view  of  the  actual  system,  which  may  be  thought  of 
as  his  VIRTUAL  SYSTEM.  (In  some  systems  the  process  family  view  and  the 
virtual  system  view  may  be  identical.)  Apart  from  inter-user 
communication  and  file  sharin'  4  a  virtual  system  appears  to  each  user  as 
his  own  private  system,  and  may  be  different  (in  part)  from  the  virtual 
system  of  other  users.  A  user  may  wish  to  invoke  several  virtual 
systems,  either  on  one  actual  system  or  on  several  systems  in  a  network. 
The  simultaneous  use  by  one  user  of  different  systems  within  a  network 
leads  to  the  concept  of  a  VIRTUAL  NETWORK,  in  which  many  details  of 
system  multiplexing  are  invisible. 

Another  user  view  arises  with  binding.  BINDING  refers  to  the  act  of 
reducing  the  indefiniteness  of  an  incompletely  specified  entity  (e.g,  by 
assigning  it  a  resource).  Levels  of  binding  specificity  typically  range 


39 


from  program  specification  to  program  generation  to  compilation  to 
object  code  generation  to  linking,  loading,  and  execution.  Linking  and 
loading  may  each  be  partially  static  (in  advance  of  execution)  and 
partially  dynamic  (being  invoked  during  execution  with  respect  to  other 
executing  programs).  At  each  successive  (lower)  level  of  binding,  more 
machine-depender.:  detail  is  added  to  a  program  or  collection  of 
programs.  This  detail  is  normally  not  vialble  to  the  higher  levels  of 
binding. 

A  conceptually  simple  but  highly  powerful  linear  structuring  of  system 
levels  is  discussed  by  Dljkstra  (68,69).  Internal  details  of 
implementation  at  a  given  level  are  normally  made  invisible  to  higher 
levels.  Functional  capability  at  that  level  is  deoendent  on  the 
capability  of  the  next  lower  level,  and  is  precisely  that  provided  by 
the  lower-level  interface  languages.  (That  functional  capability  nay  in 
fact  represent  a  loss  of  power  compared  vith  the  next  lower  lev^l.)  The 
levels  are  referred  to  as  LEVELS  OF  INVISIBILITY.  Successively  higher 
levels  correspond  to  larger  units  of  time.  (In  the  sense  that  an 
interface  creates  a  higher-level  concept,  it  provides  a  LEVEL  OF 
ABSTRACTION.) 

More  generally,  a  VIRTUAL  mechanism  is  one  that  provides  a  layer  of 
invisibility  between  the  interface  to  that  mechanism  and  the  details 
Internal  to  the  implementation  of  the  mechanism.  Independent  of  the 
structure  among  the  various  mechanisms.  It  may  in  some  cases  also 
reduce  the  power  of  that  mechJinism  available  at  the  given  interface,  but 
can  in  no  way  increase  it.  (Note  that  even  a  gate  appears  as  a  virtual 
mechanism  to  a  logic  circuit  using  it.)  This  does  not  mean  that  all 
details  of  the  use  of  such  a  mechanism  are  invisible.  In  fact, 
efficiency  considerations  may  dictate  that  some  controls  on  the  use  of 
the  mechanism  must  be  accessible  at  the  virtual  interface  (although  not 
normally  required).  Similarly,  it  may  sometimes  be  desirable  (e.g. ,  for 
efficiency)  to  use  directly  a  mechanism  at  a  more  detailed  level,  rather 
than  passing  through  many  levels  of  interfaces.  In  some  sense,  most 
mechanisms  can  be  viewed  as  virtual  mechanisms.  However,  the  PRINCIPLE 
OF  LEAST  VISIBILITY  dictates  that  implementation  detail  should  be 
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visible  only  where  necessary.  It  is  desirable  that  this  principle  have 
a  strong  influence  on  the  structure  of  the  system. 

For  any  given  set  of  virtual  mechanisms,  there  is  an  interconnection 
structure  among  them  by  virtue  of  the  use  of  their  interfaces. 

Dijkstra's  linear  structuring  of  levels  is  not  always  realistic.  In  a 
complex  i.ystem,  the  partial  ordering  among  virtual  mechanisms  may  be  an 
arbitary  directed  graph,  rather  than  a  linear  ordering.  Nevertheless, 
there  may  be  local  regions  in  which  it  is  linear  or  tree  structured.  In 
general,  it  is  highly  desirable  to  have  a  tree  structure  if  not  a  linear 
structure.  In  some  cases  it  may  also  be  desirable  to  lump  a  collection 
of  mechanisms  into  linear  levels  (e.g.,  for  descriptive  purposes  or  for 
implementation  simplicity) ,  even  though  these  mechanisms  are  r.ot 
properly  linear.  However,  the  extremes  of  excessively  simple  structure 
and  excessively  compartmentalized  structure  should  both  be  avoided.  It 

is  extremely  helpful  to  keep  these  types  of  levels  conceptually  distinct 
while  designing  a  system,  even  if  they  are  blurred  in  the  resulting 
implementation,  e.g.,  to  achieve  adequate  performance. 

3.2.2.  LEVELS  OF  CRITICALITY 

Given  a  structure  among  mechanisms  dictated  by  the  principle  of  least 
visibility,  additional  constraints  arise  In  terms  of  implicit  or 
explicit  levels  of  criticality,  e.g,,  sensitivity  to  fault-induced 
errors.  The  lowest  (or  innermost)  levels  (of  highest  sensitivity)  are 
often  referred  to  as  the  "hard-core"  or  the  "kernel"  of  a  system.  It  is 
worth  noting,  however,  that  usage  and  definitions  of  such  terms  are  far 
from  standard.  Refer,  for  example,  tc  Appendix  2.  The  term  "hard-core" 
is  used  in  at  least  three  nonequivalent  but  overlapping  fault-tolerance 
senses,  (a)  survival,  (b)  coverage,  and  (c)  exposure.  Consider 
respective  illustrations  cf  these  three  senses:  (a)  "that  which  must 
survive"  (Wensley  A2) _  or  "that  whose  malfunction  could  crash  the 
system"  (Ulrich  A2,  and  implicitly  Saltzer  A2) ;  (b),  "that  which  is 
covered  by  redundancy"  (Avizienis  A2) ;  and  (c),  "that  hardware  which  is 
irredundant"  (Hopkins  A2),  or  "that  hardware  (redundant  or  not)  who3e 
failure  will  produce  undetected  errors"  (Carter  A2).  Note  that  (b)  and 
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(c)  are  roughly  complementary  views.  Also  note  the  view  l»i  PRIME 
(Borgerson  A2)  that  there  is  NO  hard-core  (undefined),  because  the 
supervisor  can  float  from  one  processor  to  another.  An  earlier  usage  is 
"that  which  must  function  correctly" (Forbes  et  al.  65). 

A  functional  sense  of  criticality  is  also  found.  For  example,  the 
"hard-core"  paging  software  in  a  paged  environment  usually  contains  some 
programs  (e.g.,,  certain  buffers  and  programs  supporting  paging  itself) 
which  themselves  cannot  be  pAged  out.  There  is  also  software  whose 
frequency  of  use  dictates  that  it  should  remain  in  main  memory  for 
efficiency  reasons. 

In  addition  uiany  levels  of  criticality  with  respect  to  system  security 
are  relevant  here,  including  the  integrity  of  the  system  itself  and  of 
resident  files.  The  kernel  for  security  may  be  thought  of  as  that  part 
of  the  system  whose  correct  functioning  is  most  critical  to  the 
uncompromised  security  of  the  system.  A  related  concept  is  that  of  a 
SECURITY  PERIMETER,  i.e.,  a  set  of  functions  (programs,  processes,  etc.) 
within  which  system  security  may  in  some  way  be  compromised,  either  by 
misuse  or  by  malfunction.  The  security  perimeter  in  the  absence  of 
faults  seems  to  be  significantly  larger  than  is  generally  recognized. 

In  the  presence  of  faults,  it  may  be  very  large  unless  the  system  is 
carefully  partitioned.  Guarcntees  of  system  security  are  desirable,  at 
least  in  a  probabilistic  sense,  both  in  the  absence  of  faults  (but  in 
the  presence  of  possible  misuse)  and  in  the  presence  of  faults. 
Unfortunately,  the  kernels  fer  reliability,  for  availability,  end  for 
security  are  not  conceptually  identical,  even  though  most  systems  tend 
to  lump  them  together. 

3.2.3.  SYSTEM  STRUCTURE  FOR  FAULT  TOLERANCE 

Structured  system  design  and  structured  implementation  are  developing 
arts  that  have  immediate  use  in  the  design  and  implementation  of  systems 
with  economical  fault  tolerance.  Although  further  work  is  needed  to 
make  such  structure  an  Integral  part  of  the  design,  rather  than  just 
good  practice,  the  benefits  are  already  considerable.  Recent  efforts  in 
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this  direction  are  found  in  structuring  software  implementations,  e.g., 
structured  programming  (e.g.,  Dijkstra  69). 

The  notions  of  invisibility  and  criticality  impose  various  constraints 
on  the  structure  among  system  mechanisms.  For  the  purpose  of  designing 
and  implementing  fault-tolerant  systems,  the  most  critical  mechanisms 
should  be  carefully  identified,  and  separated  from  similar  out  less 
critical  mechanisms,  however,  the  various  views  of  "criticality"  should 
be  integrated.  As  noted  above,  the  interactions  among  correctness, 
availability,  and  security  are  particularly  strong.  There  are  also 
strong  interactions  with  critical  mechanisms  for  reconfiguration, 
recovery,  and  restart  following  detected  faults  (or  detected  security 
violations),  for  interrupt  handling  and  abnormal  condition  handling,  and 
for  on-line  interactive  maintenance.  These  critical  mechanisms  also  cut 
across  hardware-software  boundaries.  Table  3.2  provides  an  illustration 
of  such  critical  elements  affecting  fault  tolerance.  The  multiprocessor 
architecture  of  Section  3.3.3  illustrates  an  economical  system  using 
selective  and  dynamic  redundancy  for  these  elements. 

As  a  general  rule,  the  mechanisms  of  greatest  criticality  themselves 
should  be  well  structured  and  small  enough  to  verify  and  control.  This 
enhances  selective  and  dynamic  usage  of  various  fault-tolerance 
techniques,  when  and  where  they  are  most  effective,  whether  implemented 
in  hardware  or  in  software.  It  also  facilitates  controlling  system 
operation  and  recovery,  and  can  further  enhance  the  verification  of 
correctness  of  the  system  design  and  its  implementation,  especially  with 
respect  to  fault  tolerance  and  security.  In  this  way  it  is  also  possible 
to  anticipate  the  effect  of  faults  on  system  behavior  (including  secu¬ 
rity)  and  to  tailor  the  design  and  the  recovery  strategy  to  the  possible 
faults,  their  likelihoods,  and  their  possible  effects.  Such  design  is 
particularly  important  if  security  is  to  be  maintained  despite  faults 

Structure  among  virtual  interfaces  enters  naturally  into  system  design 
as  follows,  as  a  result  of  the  above  considerations.  It  is  desirable 
that  this  design  be  driven  more  or  less  from  the  top  down,  although  it 
is  usually  necessary  to  iterate  up  and  down  in  order  to  assure  that  the 


Table  3.2 

CRITICAL  ELEMENTS  FOR  FAULT  TOLERANCE 


MEMORY  MANAGEMENT 

Memory  maps,  e.g.,  page  tables,  device  maps,  associative  memory  maps 
Memory  contents.  Including  critical  data,  contents  of  sone  registers, 
input-output  buffers,  channel  control  words  and  interrupt  cells 
Memory  allocation  mechanisms 
Memory  bootstrap  recovery  and  reconfiguration 

PROCESSOR  MANAGEMENT 

Memory  fetches  and  address  formation,  including  page  relocation,  and 
generation  and  validation  of  protection  information 
Receipt  and  interpretation  of  interrupts 

Critical  microcode,  including  interpretation  and  protection 
Process  creation,  dispatching,  and  deletion 
Interprocessor  communication 
Some  exception  handling 

Primitive  reconfiguration  control,  configuration  sensing  and  setting 
Primitive  accounting  and  measurement  facilities 

INPUT-OUTPUT  MANAGEMENT 

Channel  control,  especially  of  shared  channels 
Certain  media  contents 
Some  exception  handling 
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design  converges  to  a  suitable  system. 

(a)  Various  system  partitions  are  established  explicitly  in  the  design, 
identifying  virtual  mechanisms  and  levels  of  criticality,  responsive  to 
the  various  overall  system  goals  of  correctness,  availability,  security, 
functional  capability,  capacity,  performance,  efficiency,  etc. 

Effective  modularization  involves  careful  control  of  communication  among 
mechanisms  to  help  limit  error  propagation.  Useful  mechanisms  are  known 
for  this  purpose,  both  to  avoid  conflicts  and  to  permit  sharing  of 
programs,  data  and  control  (e.g. ,  Dijkstra  68,  Spier  and  Organick  69, 
Holt  72,  Baer  73).  Except  for  deadlock  avoidance,  such  mechanisms  are 
conceptually  clear-cut. 

(b)  Associated  with  these  partitions  are  subpartitions  for  selective 
use  of  the  techniques  of  fault  tolerance,  as  well  as  possible 
configurations  of  these  techniques  and  possible  modes  of  dynamic 
reconfiguration  of  these  techniques  within  and  among  the  partitions. 
Successive  levels  of  binding  noted  above  may  be  useful  points  at  which 
to  bind  fault- tolerance  techniques  as  well  (dynamically  or  statically). 

(c)  Analysis,  simulation.,  verification,  and  operating  experience  should 
be  used  to  study  the  relative  effectiveness  of  these  techniques  under 
varying  demands  and  of  reliable  algorithms  for  deciding  how  and  when  to 
switch  among  configurations.  The  suitability  of  the  choice  of 
partitions  should  also  be  evaluated.  The  exact  boundaries  among 
hardware,  microprogram,  and  software  should  be  established  as  late  in 
the  design  process  as  possible.  Mechanisms  with  high  duty  cycles  should 
very  likely  appear  in  hardware  or  microprogram. 

The  applicability  of  relevant  techniques  for  fault  tolerance  to  various 
virtual  mechanisms  is  illustrated  by  Table  3.3.  The  first  column  of  the 
table  identifies  some  illustrative  interfaces.  (Higher  and  less 
machine-dependent  levels  are  toward  the  bottom  of  the  table.)  The 
second  column  gives  examples  of  concepts  invisible  at  (i.e.,  outside  of) 
each  interface.  The  third  column  gives  examples  of  the  techniques  of 
Section  3.1.  These  techniques  can  enhance  fault  tolerance  within  each 
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Table  3.3 

EXAMPLES  OF  TECHS 1QUES  FOR  FAULT  TOLERANCE 
APPLICABLE  TO  A  HIERARCHY  OF  INTERFACES 


INTERFACE!)  INVISIBLE  CONCEPTS 

(•***pl*«)  (exMplei) 


APPLICABLE  FAULT- TOLERANCE  TECHNIQUES 

(•XMplll) 


PHYSICAL  VIEW  (HARDWARE): 


Component*,  Technology  deteils, 

Chip*  fab  rice  don  act  hod  e 

Sub unite  Beard  layouts,  pin 

connections,  doing 


oits: 

/roceo so r» 


receooer  elgorithoe 
IK  traded  oedee 
Wilms  calculation, 
au«  slate  addresses, 
aa>«eclatl«e  oechenisoe 
B< a  control 
'-mterrwpts 
Microcode 


Memory  unite 


Input-output 

devlcsc 

System 


Cacde  mechanism* 
internal  rcprcsentetion 

4*temal  configurations 

Peeice  characteristics 
• 

ttedle  properties, 
device  dependence 

Configurations 


Intrinsically  reliable  technologies,  good  snglnssrlng, 
quality  control,  coding  and  fault-masking,  replication. 

Conservative  design,  reliable  connectors,  envlronaental 
control;  ‘Diagnosis,  component  repllcstloo,  some  coding, 
doubls-rell  logic,  replacement. 

‘Automatic  Instruction  retry;  ‘Replication,  cod Inf. 

‘Logic  vie  arithmetic,  double-precision  half-units. 

Bounds  checking,  descriptor  validation,  memory  protection, 
coding  and  replication  in  address  generation, 
coding  end  croas-chscklng  In  associativa  memory. 

‘Alternate  routes,  coding,  dsgrsdsbls  priority  me-henlsme. 

‘Rsce-frss  fell-operetlonsl  Interrupt  design. 

*Mi c rod 1  agnostics,  validation  of  microcode;  ceding. 

‘Automatic  reloading. 

‘Coding  on  uemory  contents;  ‘Read  after  write  at  certain  levels, 
hirdvsre-checked  descriptors  end  type  information. 

‘Reconfiguration  around  bad  memory  (vie  paging,  de-lnterlsce) . 
lee  of  read-only  memories  to  avoid  overwrite  and  eld  recovery. 

‘Coding  on  contents  of  media  and  transmission. 

‘Verification,  chscklnp,  read  and  compare  after  write. 

‘Configuration  sensing  end  self-reconfiguration,  powering  on- 
off  lncl.  spares) ,  distributing  and  replacing,  power  supplies. 


PROCESS  VIEW  (HARDWARE  AND  SOFTWARE)  : 

Virtual  Multiprocessing  by  system- 

processor  binding  of  processes 

to  processors: 
processor  dispatching 
Multi  progressing— 
multiplexing  of  proecsccs 
onto  tha  ays tea: 
process  scheduling, 
process  Isolation 
Array  computing 
Multiplexing  of  microcode 

Virtual  memory  Multiplexing  of  virtual 

memories  onto  real  memory, 
backup  and  retrieval, 
directories,  device  meps 
protection  mcehenlaas 


Virtual  Input-  Multiplexing  of  1-0, 
output  virtual  devices 

Exception  handling 
Asynchrony,  buffering, 
channel  management 


USER  VIEW  (SOFTWARE)  : 


Procsss  fasilly 

(Job) 


Virtual  system 


algorithmic  parallelism 
Allocation  of  procssaae 
Multiprocessing  by  user 

Multiplexing  of  virtual 
systems,  sharing  of  data 
System  correctness 
Command  interpolation 


Virtual  network  Multiplexing  of  computer 
ays  tame  and  their  inter¬ 
communication 


Coding,  handshaking  on  intcrproceasor  costnun  lest  Ion,  avoidcnce 
of  lnterproctnaor  Interference;  ‘Replication  of  physical  pro¬ 
cessor*  a  a  single  virtual  processor,  voting  as  needed. 
‘Configuration  Insensitivity  via  checked  table-driving. 
‘Explicit  measures  of  permitted  degradation  par  process. 
Safeguards  on  Interprocess  communication  (vs.  lost  Interrupts, 
blocked  polling),  avoidcnce  of  Interprocess  interference, 
Intrsprocess  protection  (rings,  domains,  neater  modes). 

‘Reconfiguration  and  replacement  within  the  array. 

Isolation  of  system  microcode  from  user-alterable  microcode. 

‘Replication  of  critical  date  In  various  places  In  hierarchy, 
Including  reliable  cheap  backup  store;  ‘Automatic  rollhack. 
Redundant  pointers  in  directory  structure  and  fils  maos  to 
permit  feTt  recovery;  Accaea  control  on  fllea  (e.g.,  write 
protection);  The  uae  of  pure  procedure  to  Inhibit  loss  of 
critical  data  or  programs  end  to  aid  In  automatic  rollback. 
Redundancy  in  lnterprocssa  and  fils  protection  mechanises. 

Handshaking  to  avoid  lose  of  Information;  ‘Status  Information. 
‘Device  awlt'-h ibl lity ,  madia  replication. 

‘Coding  (s.g. f  redundant  hesdsre);  ‘Flexible  error  handling. 
Race-condition  .nd  deadlock  avoidance. 

1-0  device.  Bedie.  .nd  data  protection  nerhenleaa. 


‘Replication  of  virtual  proc.e.or.  for  •  .ln,l.  proc.ee. 
‘Independent  comput.tlon.l  check,  (vl.  poeelbly  dlatlnct 
proci.ai.)  within  •  proem  feally;  ‘Autoaetic  rollback. 

Inter- wear  protection  (fro*  the  eyetes  end  each  other). 
•ControlTed  .haring  (if  any);  Self-identifying  deecrlptora. 
‘Validation,  evaluation  of  effectivan.ee  and  corractnaaa. 

•On- lint  njlnten.net [  Good  ecaipllara,  dlagnoatlea,  debugger.. 

‘Coding  on  lntereyetaa  coarailcetlon,  alternate  pethe. 
lntereyetea  protection  nechenlene. 

‘Detailed  etetue  of  network  control  and  network  requeete. 
*H«»n  Intervention  (aa  a  laet  reeort)  with  good  judgaent. 


Anterleke  denote  technique,  particularly  aatnable  to  dynaalc  uae.  Alaoet  all 
technique,  ere  eultable  for  eelectlve  uae. 
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interface,  although  the  details  of  their  implementation  should  be 
largely  invisible  at  the  int< rface,  Almost  every  technique  is  suitable 
for  selective  use.  Those  techniques  which  also  lend  themselves  to 
dynamic  reconfigurability  are  indicated  by  an  asterisk  in  the  table, 
fhe  dynamic  control  over  reconfiguration  of  such  techniques  may  be  done 
internally  at  each  level,  as  well  as  (under  controlled  circumstances) 
via  the  appropriate  interface  language.  Reconfigurations  within  one 
level  are  often  independent  of  those  at  other  levels.  Techniques  for 
reconfiguration  and  recovery  from  faults  are  found  within  most 
partitions. 

The  choice  of  structure  among  and  within  virtual  mechanisms  may  depend 
on  the  particular  system  specification.  For  example,  simplifying 
assumptions  (e.g.,  no  multiprogramming  or  no  multiprocessing)  often 
permit  simplified  structure.  Further,  each  mechanism  of  Table  3.3  may 
be  scattered  among  hardware  and  software.  Contemporary  hardware 
typically  exhibits  a  superficial  modularity  at  the  functional  unit 
level,  although  usually  not  internally  to  the  extent  desired  here. 
Multics  (Saltzer  A2) ,  Project  SUE  at  Toronto  (Sevcik  et  al.  72),  and 
Hydra  for  the  C.mmp  at  Camegie-Mellon  (Siewiorek  A2)  are  systems  that 
exhibit  good  structure  in  their  operating  systems. 

3.2.3. 1.  EXAMPLE  OF  A  STRUCTURED  FAULT-TOLERANT  COMPUTER  SYSTEM 

As  a  simple  example,  consider  a  multiprocessing  system  of  five 
processors,  each  normally  allocated  at  any  moment  to  a  distinct  process. 
At  the  VIkTUAL  SYSTEM  interface,  each  user  (or  application  environment) 
deals  with  a  command  language  interface  to  the  system  running  under  a 
process  or  process  family.  Each  virtual  system  may  in  turn  employ  one 
or  more  (real)  processes,  either  invisibly  on  behalf  of  the  operating 
system  or  visibly  on  behalf  of  the  user  to  exploit  intrinsic 
parallelism,  or  to  provide  redundant  (but  possibly  algorithmically 
distinct)  computatioas.  At  the  PROCESS  interface,  each  virtual 
processor,  virtual  memory,  and  virtual  input-output  capability  may 
involve  fault  tolerance  techniques. 
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Table  3.4 

EXAMPLES  OF  VARIOUS  MODES  OF  USAGE 


Note:  VP-virtual  processor  (possibly  replicated),  Mp-primary  memory, 
d-Hanming  distance;  S-single,  D-double,  EC-error  correcting,  ED-error 
detecting. 


At  the  VIRTUAL  PROCESSOR  interface,  a  single  process  may  be  executed  on 
just  one  processor  or  redundantly  on  different  processors,  with 
comparison  or  voting,  However,  there  appears  to  be  a  single  processor, 
exclusive  to  each  prjcess.  The  configuration  might  for  varying  periods 
of  time  include  two  of  the  five  processors  both  executing  replicas  of 
the  same  process  in  a  comparison  mode,  or  three  processors  in  a  voting 
mode,  or  even  in  rare  cases  five  in  a  voting  mode.  Internal  details  of 
such  mechanisms  should  be  mostly  invisible  to  each  process.  These  modes 
may  vary  selectively  (e.g.,  only  certain  processors  might  be  usable  in  a 
replicated  mode)  and  may  change  dynamically  (for  example  running  simplex 
except  when  certain  critical  operating  system  functions  are  invoked). 
Examples  at  this  interface  are  found  in  the  upper  half  of  each  box  in 
Table  3.4.  Examples  of  systems  possibly  able  to  provide  such 
flexibility  include  ARMMS  (Martin  A2),  C.mmp  (Siewiorek  A2) ,  and  SIFT 
(Wensley  A2). 

At  the  VIRTUAL  MEMORY  interface,  device  addresses  are  invisible.  There 
is  often  redundancy  in  the  implementation  of  a  virtual  memory  system, 
some  of  which  is  suitable  for  recovery  and  reliability.  In  systems  in 
which  memory  files  do  not  directly  become  a  part  of  a  user's  virtual 
memory  but  rather  copies  are  made  into  the  virtual  memory  (as  in  360/67 
TSS),  there  is  the  redundancy  of  the  duplicate.  In  systems  in  which 
files  (e.g.,  segments  in  Multics)  directly  become  a  part  of  a  user's 
virtual  memory  when  being  actively  used,  a  virtual  memory  page  may  be 
found  in  various  versions  and  in  various  modes  of  replication  on  various 
devices  in  the  memory  hierarchy.  For  example,  in  a  paged  environment, 
various  instances  of  a  given  page  may  exist  simultaneously  in  a 
cache-type  memory,  in  primary  memory  and  in  secondary  memory.  If  it  is 
part  of  a  procedure  that  is  "pure"  (unchanged  by  execution),  then  all 
instances  are  identical  (barring  errors);  if  it  is  data,  the  instances 
may  differ  if  there  is  no  write-through ,  or  else  may  be  identical.  This 
natural  temporary  proliferation  can  be  used  constructively  to  provide 
checkpoints,  thus  greatly  facilitating  automatic  rollback.  It  is 
especially  useful  with  various  instances  of  critical  data.  Tile 
redundancy  may  of  course  vary,  depending  on  instantaneous  needs.  In  any 
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event  the  recovery  and  rollback  strategies  must  be  carefully  integrated 
with  the  memory  hardware  and  software. 


At  the  VIRTUAL  INPUT-OUTPUT  interface,  many  details  of  devices  arc 
invisible  (e.g.,  formats,  recovery  strategies,  and  asynchrony).  Fault 
tolerance  can  be  increased  by  extensive  'ise  of  table  driving,  providing 
possibilities  for  the  use  of  coding  in  the  tables,  as  well  as  isolating 
the  handling  of  various  devices. 

At  the  PROCESSOR  interface,  the  structure  and  the  implementation  of  each 
processor  are  largely  invisible.  There  may  be  several  levels  of 
invisibility  inside  this  interface.  As  seen  by  an  instruction,  for 
example,  automatic  instruction  retry  and  physical  memory  addresses  are 
typically  Invisible.  Selective  replication  and  replacement  are  suitable 
for  logic  and  arithmetic,  with  coding  useful  for  arithmetic  in  some 
cases.  Especially  critical  within  this  level  is  address  generation, 
with  respect  to  both  security  and  reliability.  Coding  and  replication 
are  useful  in  assuring  that  addresses  are  correctly  generated. 

At  the  MEMORY  interface,  byte-slicing,  coding  and  reconfiguration 
(discussed  in  detail  in  Chapter  A)  are  examples  of  fault  tolerance 
techniques  that  are  usually  invisible  to  the  effective  address  of  an 
Instruction.  All  three  can  benefit  from  selective  usage.  For  example, 
different  codes  may  be  used  in  different  portions  or  types  of  memory. 

It  may  even  be  desirable  to  have  some  memory  (e.g.,  for  use  by  critical 
operating  system  data)  with  greater  redundancy.  These  techniques  also 
may  benefit  from  dynamic  usage.  One  such  approach  to  coding  entails 
different  uses  of  a  particular  encoding.  For  example,  consider  a  code 
with  Hamming  (or  arithmetic)  distance  A  for  single-error  correction  and 
double-error  detection.  When  one  error  is  known  to  be  permanent,  the 
code  may  actually  be  used  to  correct  a  second  error  (Slewiorek  and  Ingle 
73).  When  the  multiple  error  rate  is  high,  the  code  msy  better  be  used 
for  triple-error  detection  (accompanied  by  increasingly  loud  cries  for 
help).  Another  such  example  is  the  use  of  a  byte-error  correcting  code 
as  a  multiple-error  detecting  code  when  multiple-byte  errors  are 
suspect.  Still  another  dynsmic  approach  is  the  use  of  varying  encodings 
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depending  on  usage,  even  possibly  by  changing  the  word  length.  For 
exartple,  it  might  be  advantageous  to  use  different  encodings  for 
different  types  of  information,  e.g.,  for  data  and  for  instructions  to 
be  executed.  This  could  be  aided  by  a  tagged  architecture  (e.g., 
Feustel  73,  Hiller  A2).  Note  that  there  is  dynamic  (but  hardware 
supported)  duplication  of  data  in  memory  in  the  Intermetrics 
Multiprocessor  of  Miller  (A2).  (Typical  examples  at  this  interface  are 
found  in  the  lower  half  of  each  box  in  Table  3.4.) 

Similarly  at  the  MODULE  interface,  multiple  arithmetic  or  functional 
units  tied  to  a  control  unit  may  be  used  in  replication  for  fault 
tolerance,  or  in  synchronism  as  in  the  ILLIAC  IV  for  handling 
parallelism  in  computation,  or  independently.  The  first  of  these 
applications  substantially  increases  reliability,  while  the  others  may 
substantially  increase  the  computational  throughput.  Degraded  but 
continued  operation  may  be  achieved  with  multiple  or  byte-sliced  units, 
e.g.,  by  invoking  a  multiprecision  mode  among  reduced  precision  units. 

Explicit  structures  of  virtual  mechanisms  are  now  evident  in  a  few 
recent  computing  systems,  both  in  hardware  and  in  software.  For 
example,  the  Multics  protection  mechanism  (Schroeder  and  Seltzer  72) 
provides  successive  linear  levels  of  resilience  to  errors  in  hardware, 
software,  and  humans  in  its  levels  of  protection.  A  spectrum  of 
criticality  exists  with  respect  to  faults.  Only  malfunctions  (hard  or 
soft)  at  the  lowest  software  level  affect  the  viability  of  the  system. 
Others  have  diminishingly  serious  effects  on  the  correctness  of 
operation  as  the  level  increases,  e.g.,  aborting  one  user’s  process, 
aborting  one  command,  or  aborting  just  one  request  within  a  command.  As 
with  hardware,  software  techniques  for  fault  tolerance  may  also  differ 
from  level  to  level.  An  example  is  provided  by  the  SIFT  environment 
(Wensley  A2),  in  which  a  wide  range  of  real-time  criticality  is  found 

among  various  tasks,  and  for  which  redundancy  can  be  suitably  configured 
to  the  task. 
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3.2.4.  IMPLICATIONS  OF  STRUCTURED  DESIGN 


In  this  section  we  discuss  the  numerous  benefits  of  the  structured 
design  approach.  These  include  enhancements  of  reliability  and 
computational  capacity,  and  reduction  of  cost. 

SELECTIVE  AND  DYNAMIC  APPLICATION  0?  REDUNDANCY.  A  wide  variety  of 
techniques  for  fault  tolerance  can  be  applied,  each  where  it  is  most 
effective  and  responsive  to  the  needs  for  fault  tolerance  and  computing 
capacity.  Each  configuration  of  fault-tolerance  techniques  can  be 
dynamically  altered,  on  the  basis  of  the  current  usage  of  the  system. 
(The  reconfiguration  may  affect  more  than  one  level  at  once.)  The  net 
cost  of  system  fault  tolerance  can  therefore  be  reduced,  especially  if 
rarely  used  fault-tolerance  techniques  can  be  performed  reliably  in 
software.  Considerable  savings  also  result  if  occasional  modest 
real-time  delays  are  permitted  (e.g.,  for  diagnosis,  recovery,  and 
reconfiguration),  further  reducing  the  need  for  dedicated  hardware.  The 
typically  nonuniform  distribution  of  costs  within  a  system  also  permits 
a  reduction  of  the  incremental  cost  of  fault  tolerance.  Memory  costs 
(including  secondary  storage)  seem  to  dominate  total  hardware  costs  in  a 
well  balanced  system,  even  in  emerging  technologies  (see  Chapter  6). 
Consequently,  the  relatively  small  cost  of  redundancy  in  memory  (e.g., 
varying  logarithm! cl ly  with  word  length  for  single-error  or  byte-error 
correction  throughout  memory)  may  dominate  the  incremental  cost,  even 
with  replicated  processors,  but  even  more  so  with  dynamic  and  selective 
replication.  Dynamic  and  selective  use  of  coding  (e.g..  Table  3.4) 
further  reduces  the  cost  of  fault  tolerance.  A  tagged  architecture  may 
be  of  significant  help  in  this  respect.  Structured  design  also 
facilitates  checkpoint  mechanisms  that  permit  varying  degrees  of 
rollback  at  different  levels,  as  needed.  On-site  maintenance  and 
diagnosis  are  also  aided. 

GRACEFUL  DEGRADATION.  In  general,  computing  capacity  not  currently 
dedicated  to  fault  tolerance  is  available  for  useful  computing,  assuming 
reasonable  system  balance.  It  is  desirable  to  configure  among  pools  of 
modules,  functional  units,  processors,  and  systems.  The  multiplicity  of 
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each  pool  should  be  large  enough  so  that  graceful  degradation  is 
possible  (i.e.,  that  the  loss  of  any  unit  i not  serious).  This 
increases  the  overall  system  effectiveness,  in  terns  of  both  computing 
capacity  and  fault  tolerance. 

SIMPLIFICATION  OF  THE  DESIGN  PROCESS.  Well-chosen  system  structure  can 
enhance  eich  stage  of  system  development  (including  designing, 
implementing,  documenting,  debugging,  certifying,  analyzing, 
maintaining,  and  modifying  the  system).  At  each  such  stage  the  notion 
of  levels  of  invisibility  permits  issues  of  fault  tolerance  relevant  to 
lover  levels  to  be  abstracted  and  analyzed,  aiding  in  isolating  any 
side-effects.  Thus  the  structure  serves  as  a  useful  model  as  well. 

ADAPTABILITY  TO  ADVANCED  TECHNOLOGY.  Recent  technological  advances 
(e.g.,  LSI)  significantly  improve  the  cost-effectiveness  of  many  of  the 
techniques  for  fault  tolerance.  These  advances  should  also  stimulate 
new  architectural  directions,  such  as  multiprocessors  with  considerable 
multiplicity,  and  distributed-logic  and  logic-in-memory  designs.  The 
latter  case  involves  large  arrays  of  small  memory  elements,  each 
containing  processing  capability.  These  arrays  could  be  organized  into 
subarrays  of  subarrays,  possibly  with  structures  geometrically  oriented 
toward  the  problem  to  bo  solved  (cf.  Kautz  and  Levitt  72). 

APPLICABILITY  TO  FAULT-TOLERAMT  SYSTEMS.  The  structural  approach  seems 
particularly  effective  for  large  general-purpose  systems.  It  also  seems 
useful  for  many  systems  with  some  tight  real-time  constraints,  for  whidi 
selective 'redundancy  can  result  in  significant  cost  savings,  compared  to 
the  uniform  use  of  high-order  replication. 


Questions  of  overhead  and  reliability  must  be  examined  carefully.  It 
appears  that  the  overhead  due  to  the  use  of  structure  can  usually  be 
kept  small,  except  when  fault— tolerance  limits  are  approached.  It  is 
obviously  desirable  that  the  mechanisms  for  controlling  reconfiguration 
must  themselves  be  fault  tolerant,  thrash-resistant ,  secure,  and 
reconf igurable.  Interference  problems  and  intercommunication  must  also 
be  handled  reliably. 
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In  highly  structured  systems,  there  is  a  basic  profclsat.  r i 
overhead  of  multilevel  interpretation  (i.e.t  of 
many  levels  of  language).  This  problem  is  ali.«*viace4 
certain  low-level  language  constructs  to  be  directly  (M 
available  from  outer  levels.  Where  explicit  level  ci 
(as  in  protection  mechanisms),  the  interlevel  conaunicctlc 
should  be  simple.  Judicious  use  of  hardware  for  such  mechanism  is 
essential,  as  in  the  case  of  various  associative  shortcut  mechanisms. 

In  some  cases  it  is  also  advantageous  to  reduce  the  number  of  conceptual 
levels  in  the  Implementation. 


Various  questions  remain  unanswered  by  this  discussion.  Can  the 
tradeoffs  among  fault  tolerance,  computing  capacity,  cost,  overhead, 
etc.,  be  rigorously  characterized?  Under  what  circumstances  is  it 
desirable  to  reconfigure?  What  kind  of  limiting  behavior  occurs  as 
computing  capacity  or  fault-tolerance  capacity  is  reached?  What  are  the 
penalties  associated  with  having  too  little  or  too  much  structure?  What 
happens  to  the  notion  of  the  "weakest  link",  namely,  those  mechanisms  to 
whose  malfunction  the  system  is  most  vulnerable?  Can  this  notion  be 
distributed  among  less  weak  links?  How  does  it  shift  during 
reconfiguration? 


Our  assessment  of  the  structured  design  approach  is  that  it  has  the 
potential  for  providing  highly  flexible  and  economic il  fault  tolerance 
without  greatly  compromising  system  cost,  system  performance,  and  system 
efficiency.  Some  qualities  of  structure  are  found  in  the  purrent  art, 
but  full  realization  of  this  potential  requires  further  development. 

3.2.5.  STRUCTURED  RECOVERY  STRATEGIES  AND  MASSIVE-TRANSIENT  RECOVERY 

One  useful  approach  for  effective  recovery  over  wide  ranges  of  faulty 
behavior  follows  the  PRINCIPLE  OF  LEAST  EFFORT  (Zipf  49).  It  is 
desirable  to  structure  the  system  so  that  subsequent  to  a  fault,  the 
availab-*  xity  of  the  most  essential  services  can  be  restored  as  rapidly 
as  possible,  deferring  (or  overlapping  with  restored  operating  capacity) 
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that  which  need  not  be  done  immediately.  In  this  way.  It  is  possible 
for  the  system  to  recover  by  successive  iteration,  outward  from  the  most 
critical  mechanisms.  (See  Carter  et  al.  71a  for  a  discussion  of  the 
recovery  problem  and  the  control  of  recovery.  See  also  Williams  A2, 
Saltzer  A2,  Stern  73,  and  Stern  and  Van  Vleck  73.) 

As  an  example  of  a  specific  problem  that  can  be  greatly  simplified  by 
the  adoption  of  a  hierarchical  structure  and  hierarchical  recovery 
strategies  reflecting  that  structure,  consider  the  "massive-transient" 
recovery  problem: 

A  correlated  fault  source  (e.g.,  a  power  surge  or  a  bolt  of  lightning) 
has  left  all  units  of  the  system  suspect,  perhaps  introducing  both 
transient  and  permanent  faults.  The  problem  is  for  the  system  to 
diagnose  and  configure  Itself  back  into  a  working  configuration  and  to 
'■'lidate  Itself  for  correctress,  all  under  its  own  control.  Note  that 
the  software  as  well  as  the  hardware  must  be  considered  suspect. 

This  problem  is  essentially  a  generalized  f ault-tolerance  problem,  where 
performance  may  cease  temporarily  during  and  just  after  the  massive 
transient.  It  is  also  closely  related  to  normal  system  initialization. 
Design  structure  and  dynamic  reconfigurability  both  aid  greatly  in 
solving  this  problem.  One  solution  involves  validating  a  correct 
configuration  of  hardware  and  bootstrapping  upward  from  the  lowest 
levels,  until  a  satisfactory  rudimentary  system  is  obtained.  This 
solution  is  aided  by  the  use  of  a  hard-wired  non-volatile  read-only 
memory  which  provides  a  basis  of  correct  programs  for  recovery.  Further 
help  is  offered  if  pure-procedure  instructions  in  this  memory  can  be 
executed  directly,  and  if  these  programs  operate  only  out  of  local 
memory  at  first.  By  working  outward,  valid  portions  of  the  system  begin 
to  emerge.  Also  useful  for  providing  checkpoints  may  be  cheap 
once-writable  memories  (possibly  asynchronous  to  the  main  control). 
(Another  approach  is  to  try  experiments  on  various  configurations  of  the 
whole  system.)  Note  that  this  problem  may  be  intrinsically  insoluble 
for  a  given  system  configuration.  It  My  also  be  insoluble  for  the 
particular  massive  transient,  e.g.,  because  not  enough  operational 
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equipment  remains  to  self-diagnose  and  configure  a  valid  system,  or  even 
just  tc  operate  such  a  system.  (Furthermore,  more  equifuient  may  be 
required  for  diagnosis  than  for  operation.) 

3.3.  ARCHITECTURES  FOR  FAULT  TGLEPANCE 

This  section  describes  some  general  architectural  configurations  for 
fault-tolerant  computers.  Our  intention  is  to  show  how  the  variety  of 
techniques  outlined  in  Sections  3.1  and  3.2  may  be  applied  to  the  design 
of  complete  economical  systems  with  high  availability  and  high  degrees 
of  correctness  as  desired.  We  do  not  delve  deeply  into  the  design  of 
particular  systems,  but  rather  merely  attempt  to  justify  their 
fault-tolerance  behavior.  For  each  system  architecture,  we  indicate  an 
estimate  of  overall  redundancy,  reliability,  and  availability  measures. 

We  also  give  methods  by  which  error  detection  and  recovery  can  be 

achieved,  and  a  general  assessment  of  the  system.  Aplications  for  these 
architectures  are  considered  in  Chapter  6. 

We  examine  various  types  of  system  architectures  here.  Section  3.3.1 
considers  simplex  systems,  that  is,  systems  with  a  single  instruction 
stream,  but  possibly  with  replicated  processors.  Section  3.3.2 
considers  multicomputers  (including  networks)  and  loosely-coupled 
multiprocessor  systems.  Section  3.3.3  considers  strongly-coupled 
multiprocessors,  e.g. ,  with  sharing  of  data  in  memory  among  processors. 
Most  ox  the  system  types  form  the  basis  for  systems  surveyed  in 
Appendices  1  and  2,  although  several  types  discussed  here  have  not  yet 
matured  into  prototype  or  even  paper  designs  as  yet. 


Where  fault  tolerance  is  a  design  goal,  it  can  easily  be  incorporated 
into  the  design.  In  general,  however,  it  cannot  be  retrofitted 
effectively  into  an  existing  implementation.  As  indicated  in  Chapter  6, 
suitable  architectures  for  fault  tolerance  exist  for  all  common 
computational  applications.  For  these  applications,  fault  tolerance  can 
be  achieved  by  the  exclusive  use  of  hardware  techniques,  requiring 
little  modification  to  the  operating  system.  However,  if  the  degree  of 
fault  tolerance  is  to  be  matched  to  the  application  needs,  and  is  to 
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require  less  redundancy  than  that  associated  with  replication,  then  much 
more  reliance  on  software  is  needed.  In  particular,  the  operating 
system  becomes  significantly  more  complex,  and  perhaps  represents  a  more 
likely  source  of  errors  than  faulty  hardware. 

The  credibility  of  a  particular  fault- to leranci  concept  is  of  great 

concern.  In  the  aerospace  environment,  the  general  practice  has  been  to 

design  extremely  simple  and  crudely  replicated  systems.  This  simplicity 

is  a  consequence  of  the  demand  for  systems  that  are  obviously  reliable, 

and  perhaps  amenable  to  human  error  detection  and  reconfiguration.  This 

demand  has  precluded  the  use  of  the  less  redundant  (although  more 

complex)  fault-tolerance  techniques  described  in  this  report.  We  feel 

that  these  better  techniques  will  become  more  acceptable  as  the  new 
technologies  emerge,  and  as  operating  systems  become  more  reliable. 

Advanced  fault-tolerant  systems,  i.e.,  those  with  high  availability, 

fast  recovery,  and  low  cost,  will  place  high  demands  on  the  operating 

system. 


3.3.1.  SIMPLEX  SYSTEMS 


In  this  subsection  we  view  a  simplex  processor  system  as  one  in  which 
only  a  single  central  processor  is  present,  or  in  which  all  central 
processors  are  intended  to  operate  with  identical  instruction  streams 
and  data.  The  earliest  conceptions  of  fault-tolerant  systems  were 
simplex  systems,  employing  low-level  redundancy  techniques  (e.g.,  in 
gates  or  registers). 

3.3. 1.1.  REDUNDANCY  ONLY  IN  MAIN  MEMORY 

The  cost  and  unreliability  of  most  contemporary  systens  are  largely 
dominated  by  the  main  memory.  (We  exclude  peripherals  from  the 
immediate  discussion,  since  their  effects  can  be  readily  decoupled.) 
Typically,  the  main  memory  is  50%  to  75%  of  the  total  digital  circuitry 
in  a  medium  to  large  system.  The  main  memory  can  be  made  reliable  by 

techniques  embodying  varying  combinations  of  error  detection,  error 
correction,  block  replacement,  and  chip  replacement.  The  use  of  these 
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techniques  can  provide  fault-tolerance  ranging  from  a  minimum  of  error 
detection  in  memory  to  completely  autonomous  error  confinement, 
reconfiguration,  and  recovery  in  response  to  memory  failures.  With  the 
use  of  these  techniques,  the  memory  is  from  5%  to  35%  redundant, 
depending  on  the  word  length,  byte  length,  and  the  desired  degree  of 
fault  tolerance.  There  are  two  possible  deficiencies  associated  with 
memory  coding  and  block  replacement. 


*The  memory  is  prone  to  faults  in  external  equipment,  notably  power 
supplies.  Ihis  problem  can  be  alleviated  by  providing  a  separate  power 
supply  for  each  block  or  for  each  byte  slice  of  memory. 

*  If  only  the  memory  is  protected  by  redundancy,  the  unreliability  of 
the  system  is  decreased  only  by  a  factor  of  about  3.  Hence  some  form  of 
processor  fault  tolerance  is  still  needed. 

Memory  fault  protection  is  rapidly  becoming  a  common  practice.  Most 
machines  have  a  parity  check  optica  on  main  memory,  and  some  newer 
machines  (e.g. ,  IBM's  System/370)  incorporate  error  correction  at  the 
bit  level  or  at  the.  byte  level.  Moat  machines  with  relocation  hardware 
are  capable  of  reconfiguration  around  one  or  more  faulty  memory  blocks. 
This  is  a  primitive  form  of  graceful  degradation  in  that  main  memory 
functions  are  either  lost  or  taken  over  by  secondary  or  paging  memories, 
with  an  accompanying  reduction  in  performance.  However,  we  know  of  no 
working  machines  thac  achieve  reconfiguration  autonomously,  subsequent 
to  a  detected  error.  Such  reconfiguration  is  not  difficult  to  achieve, 
and  can  extend  the  up-time  of  a  system  enormously. 

3.3. 1.2.  REDUNDANCY  IN  MAIN  MEMORY  WITH  PROCESSOR  REPLICATION 

The  simplest  approach  to  tolerating  faults  in  processors  is  to  use 
replicated  processors  and  provide  some  mechanism  for  resolving 
discrepancies  among  their  outputs.  In  one  mode  the  processors  are 

duplicated  and  the  two  inritruction  streams  ire  synchronously  compared 
before  being  accepted  as  correct.  Any  discrepancy  can  trigger  a 
single-instruction  retry  in  the  hope  that  the  fault  causing  the 
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discrepancy  is  transient.  If  the  retry  fails  the  first  time,  a  more 
complex  recovery  may  be  invoked.  Finally,  if  all  retry  attempts  fail 
because  the  fault  is  permanent,  autonomous  or  human  diagnosis  can  be 
undertaken  to  identify  the  faulty  processor.  The  system  is  then 
reconfigured  to  use  just  the  good  processor.  In  any  event  the 
duplicated  processor  scheme  can  clearly  prevent  an  incorrect  result  due 
to  a  single  faulty  processor.  With  the  inclusion  of  some  diagnostic 
procedures  it  can  provide  a  system  that  remains  available  in  the 
presence  of  one  faulty  processor.  Such  a  system,  including  the  cost  of 
memory  coding,  may  have  from  33%  to  45%  redundancy,  depending  on  the 
dominance  of  memory  in  the  system.  Besides  its  relatively  high 
redundancy,  this  approach  has  two  operational  deficiencies. 

*  Inadequacies  in  the  current  diagnosis  practice  preclude  the  use  of 
this  approach  in  the  most  exacting  fault-tolerance  situations.  That  is, 
most  diagnostic  programs  are  successful  in  handling  no  more  than  90%  of 
the  fault  possibilities.  Thus  subsequent  to  a  processor  failure,  the 
system  may  not  be  successfully  reconfigured  as  much  as  10%  of  the  time. 

*  The  comparison  of  processor  outputs,  if  carried  out  in  hardware, 
introduces  a  few  extra  gate  delays.  In  high-speed  applications  it  might 
be  possible  to  pipeline  this  comparison  with  other  operations  at  the 
expense  of  extra  circuitry. 

If  a  higher  probability  of  successful  autonomous  response  to  an  error  is 
required,  then  a  triplicated  or  higher-order  replicated  processor  can  be 
used.  The  processors  can  then  be  operated  in  a  voting  mode  or  a  dynamic 
voting  mode  if  there  are  more  than  three  processors.  The  processor  and 
memory  are  approximately  in  balance  if  the  memory  operates  with 
single-error  correction  and  single  block  replacement  and  if  the 
processor  is  triplicated.  In  this  case  the  probabilities  that  the 
memory  or  processor  exhaust  their  respective  resources  are  roughly 
identical.  Recovery  in  the  case  of  a  triplicated  processor  should  still 
include  a  single  instruction  retry,  before  trying  to  restart  from  an 
earlier  state,  or  before  discarding  the  disagreeing  processor.  The 
major  drawback  of  grossly  replicating  the  processor  is  cost. 
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Triplication  of  processors  with  error-correcting  coding  in  memory  can 
have  redundancy  ac  high  as  60%. 

3.3. 1.3.  TRIPLICATED  SYSTEMS 

We  discuss  above  the  deficiencies  due  to  the  use  of  different  redundancy 
techniques  for  the  processor  and  memory.  Another  deficiency  for  certain 
applications  is  the  need  to  modify  the  basic  computer  design.  One 
simple  way  of  avoiding  these  difficulties  is  to  operate  multiple 
computers  (including  their  memory)  in  a  duplicated  or  triplicated  mode. 
The  results  are  compared  whenever  information  leaves  a  computer,  e.g., 
to  a  channel.  The  comparison  in  this  case  is  done  at  such  a  low  duty 
cycle  that  software  voting  may  be  feasible.  When  a  disagreement  is 
detected,  the  backtrack  can  be  to  the  beginning  of  a  computation  or  to 

the  last  channel  invocation.  In  this  case  the  need  for  saving  register 
states  in  order  to  achieve  single  instruction  retry  is  avoided.  Of 
course,  the  main  drawback  of  a  uniformly  triplicated  system  is  Its 
redundancy,  whicr  exceeds  6  7%. 

This  single  replicated  virtual  processor  concept  was  used  in  the  Saturn 
V  guidance  computer.  It  Is  a  possible  mode  of  operation  in  a  version  of 
SIFT  (Wensley  A2)  which  is  stripped  down  to  exclude  multiprocessing,  and 
forms  the  basis  for  a  flight- control  computer  under  consideration  by 
NASA-Langley. 

3. 3. 1.4.  REDUNDANCY  APPLIED  OVER  PROCESSOR  PARTITIONS 

Among  the  major  drawbacks  of  the  triplicated  processor  scheme  and  the 
triplicated  system  scheme  are  the  following. 

*  After  a  single  processor  failure,  all  spare  processor  resources  are 
exhausted. 

*  The  crude  redundancy  technique  does  not  take  advantage  of  the  unique 
structvre  of  particular  processor  sub-blocks.  Thus  the  redundancy  is 
higher  than  it  needs  to  be. 


L'nder  certain  circumstances,  an  attractive  scheme  is  to  decompose  a 

processor  into  sub-blocks  and  to  apply  redundancy  techniques  appropriate 

to  each  sub-block.  For  example  in  the  STAR  computer  (Self-Testing  and 

Repair,  Avizienis  A 2)  the  following  sub-blocks  are  identified: 

arithmetic  unit,  logic  unit,  and  control  unit.  For  a  more  powerful 

computer  than  the  STAR,  one  could  also  include  stacks,  expanded  register 

sets,  and  scratch-pad  memories,  for  example.  The  basic  fault-tolerance 

method  is  to  detect  an  error  at  an  interface  to  one  of  these  sub-blocks, 

and  if  necessary,  to  replace  that  sub-block  with  a  spare.  Residue  codes 

are  used  for  error  detection  in  the  arithmetic  unit  (and  in  the  memory), 

a  2-out-of-4  code  for  instructions,  and  duplication  elsewhere.  In  all 

approaches  of  this  type,  there  is  the  need  for  some  overall  executive 

within  the  processor  to  act  as  the  ultimate  arbiter  of  all  detected 
errors  In  the  STAR,  the  TARP  (Test  and  Repair  Processor)  serves  this 

function,  and  is  itself  triplicated.  Note  that  the  TARP  really  serves 

as  a  "smart"  bus  with  all  inter-block  transfers  passing  through  it. 

The  system  can  be  as  low  as  40%  redundant  with  a  spare  for  each 
sub-block.  Moreover  the  up-time  can  be  extended  by  a  factor  of  10, 
because  of  the  partitioning  of  the  processor.  One  of  the  most 
compelling  advantages  is  that  no  radical  change  is  needed  in  the 
functional  partitioning  of  the  system.  The  major  deficiencies  of  this 
approach  are  the  following. 

*  Major  internal  processor  delays  may  be  encountered  due  to  the  TARP,  a 
situtaion  that  might  be  alleviated  by  pipelining  its  operation  with 
other  units. 

*  In  an  LSI  implementation  the  partitioning  might  not  be  appropriate. 
This  is  particularly  true  if  the  computational  requirements  can  be  met 
by  a  one-or- two-chip  computer.  However,  in  much  larger  installations 
MST  is  likely  to  be  used  in  the  near  future.  With  an  MSI 
implementation,  a  relatively  fine  partitioning  is  feasible. 

An  early  version  of  the  SERF  computer  of  Raytheon  (Stiffler  73)  employs 
a  partitioning  similar  to  STAR.  However,  the  arithmetic- logic  unit  is 
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decomposed  into  bytes  while  external  switching  can  provide  a  routing 
around  faulty  bytes.  A  partitioning  this  fine  is  appropriate  onl>  where 
effectively  zero  maintenance  is  required  for  long  periods  of  time.  The 
MECRA  computer  (Delamare  A 2)  uses  a  variety  of  coding  techniques  within 
the  processor.  In  addition,  graceful  degradation  is  achieved  by  program 
modification,  using  microprograms  that  remain  correct. 

Our  conclusion  with  regard  to  partitioned  processors  is  that  technology 
advances  have  precluded  their  applicability  for  their  originally 
intended  application,  aerospace.  However,  they  appear  useful  for  large 
processor  in? tallations ,  provided  the  delay  problems  can  be  solved. 

3. 3. 1.5.  M I f ' RO P ROG RAM-0 R I ENT E D  PROCESSORS 

Hany  small  to  medium  size  processors  achieve  a  rich  instruction  set  by 
microprogramming.  As  the  availability  of  high-aneed  memories  increases, 
it  is  likely  that  microprogramming  will  appear  in  all  but  the  super-fast 
computers.  Microprogramming  is  used  to  realize  many  complex 
instructions  that  otherwise  would  require  special  hardware.  Thus  the 
instruction  unit  can  be  simplified  and  many  special  logic  boxes  (e.g. , 
floating-point  hardware,  multipliers,  interrupt  handlers)  can  be 
eliminated.  The  net  result  is  a  total  computer  in  which  only  about  10% 
of  the  hardware  is  not  a  memory  function.  Straightforward  coding 
techniques  can  be  used  for  error  detection  or  correction.  In  addition, 
with  writable  control  store,  the  microprograms  can  be  paged  and  routed 
from  a  failed  memory  block  to  an  operative  on«,  or  in  an  extreme  case, 
to  a  memory  block  in  slower  memory.  Crude  redundancy  techniques  can  be 
used  for  the  non-memory  hardware,  at  comparatively  little  incremental 
cost. 


3. 3. 1.6.  PROCESSORS  WITH  DEFERRED,  PARTIAL,  OR  PROBABILISTIC  DETECTION 

Most  of  the  architectures  discussed  above  aim  at  correctness  for  all 
computations  and  availability  in  "he  presence  of  single  faults. 

However,  in  many  applications  correctness  is  essential  only  for  certain 
computations,  such  as  those  involving  security  and  file  protection 
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(including  address  generation).  We  describe  here  a  system  wherein 
single  hardware  faults  cannot  result  in  a  security  breach.  It  would  be 
preferable  that  the  machine  halt  rather  than  propagate  in  error  critical 
to  its  security.  To  achieve  this  safeguard,  certain  «:q  lipment  (notably 
base,  mapping,  and  relocation  registers)  must  be  protect ed  against 
faults.  Error-detecting  codes  can  help  here.  In  addition,  the 
functions  that  read  or  modify  access  rights  must  be  checked.  This 
requirement  can  be  accomplished  by  consistency  checks  or  duplication  in 
space  or  time.  A  more  reliable  but  less  elegant  solution  is  to  provide 
a  special  replicated  hardware  unit  within  the  processor  that  would 
execute  primarily  those  functions  within  the  security  perimeter.  If 
this  unit  can  be  designed  so  as  to  consume  a  small  fraction  of  the 

computational  resources  in  an  Integrated  non-fault-tolerant 
implementation,  then  a  replicated  minicomputer  within  a  large  processor 

might  suffice  to  achieve  fault  tolerance. 

The  detection  of  only  those  errors  that  are  in  some  sense  critical  is  a 
special  case  of  deferred  detection  (Section  3. 1.1.3).  Short-of  coding 
or  duplication,  many  features  cm  be  included  in  a  processor  to  enhance 
detection.  For  example,  the  use  of  a  tagged  (or  descriptor-based) 
architecture  (e.g.,  Feustel  73,  or  the  Burroughs  B5500)  can  be  used  as  a 
valuable  tool  for  detecting  hardware  errors.  Any  error  that  leads  to  a 
type  violation  (e.g.,  execution  of  data,  or  adding  a  floating-point 
number  to  a  Boolean  value,  or  an  attempt  to  manipulate  a  capability) 
could  be  detected.  The  central  problem  with  this  technique  is  to 
protect  the  hardware  that  manipulates  the  tags. 


3. 3. 1.7.  CONCLUDING  REMARKS  ON  SIMPLEX  PROCESSOR  ARCHITECTURES 

As  noted  in  Chapter  6,  simplex  architectures  are  the  most  prevalent 
today.  They  will  probably  remain  common  in  the  future,  at  least  in 
super-fast  systems.  Efficient  methods  exist  for  designing 
fault-tolerant  simplex  systems.  For  example,  a  system  that  is  40%  to 
50%  redundant  can  be  correct  and  available  in  the  presence  of  single 
faults.  In  a  multiprocessor  organization,  this  redundancy  can  be 
reduced  by  a  factor  of  at  least  2  for  applications  in  which  most 
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computations  need  not  be  carried  out  reliably,  provided  certain  critical 
computations  are  reliable.  That  is,  the  multiprocessor  organization 
discussed  below  is  better  matched  to  applying  redundancy  nonuniformly  in 
time. 


3.3.2.  LOOSELY-COUPLED  MULTIPROCESSOR  ORGANIZATIONS 

In  this  section  we  describe  several  multiprocessor  architectures  that 
exhibit  economical  fault  tolerance.  We  assume  applications  for  which 
the  various  tasks  run  substantially  independent  of  each  other,  in 
separate  memory  blocks.  As  noted  below,  the  absence  of  sharing  and  or 
strong  interprocessor  communications  greatly  simplifies  the  design  of 
such  fault-tolerant  multiprocessors.  Multiprocessors  with  strong 
dependence  among  processors  (e.<^.,  with  shared  use  of  memory)  are 
considered  in  Section  3.3.3. 

It  is  clear  that  multiple  processors  are  effective  for  fault  tolerance, 
for  at  least  the  following  reasons: 

*  Processors  and  memory  blocks  represent  good  replacement  units. 

*  When  all  resources  (processes,  memories,  etc.)  are  operative,  they  are 
all  kept  busy  doing  useful  work.  As  resources  fail,  the  operative  ones 
take  up  the  slack  with  a  loss  in  performance.  Thus,  the  long-standing 
goal  of  a  gracefully  degraded  system  is  readily  achieved  with  a 
multiprocessor,  except  for  the  detection  and  diagnosis  problems. 

*  It  is  possible,  in  principle,  to  achieve  redundancy  that  is  variable 
in  time  and  space.  For  certain  critical  computations,  several 
processor-memory  pairs  can  operate  in  a  replicated  mode.  Moreover,  this 
replication  can  be  modified  dynamically  in  time. 

We  divide  multiprocessor  organizations  into  three  types:  fixed 
multicomputer  systems,  configurable  multi-computer  systems, 
multiprocessors  with  common  memory.  (Note  again  that  shared  memory  is 
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discussed  in  Section  3.3.3.)  Included  are  systems  in  which  there  is  an 
active  processor  ar,d  a  monitor  processor,  and  network^  of  systems.  We 
discuss  fault- to  ler.uice  techniques  available  for  these  organizations. 

These  designs  have  not  generally  been  suitable  for  efficient  operation 
of  a  large-scale  general-purpose  interactive  computing  (e.g.,  a  computer 
utility).  Most  such  designs  have  been  suggested  for  an  aerospace 
environment.  The  main  reason  for  the  unsuitability  of  these  designs  to 
such  applications  is  that  hardware  is  not  present  to  support  sharing  or 
flexible  communication  between  error-prone  processes  executing  in 
different  processors  or  memories.  Several  of  the  designs  permit 
interprocess  comnuni cation,  provided  the  processors  all  operate  in  a 
replicated  mode.  In  a  trivial  sente  the  system  then  is  protected 
against  security  breaches  caused  by  single  faults ,  but  we  do  not 
consider  this  to  be  a  satisfactory  solution  for,  say,  a  computer 
utility.  A  more  desirable  solution  is  outlined  below,  in  which 
replication  is  avoided.  The  omission  in  this  subsection  of  hardware  to 
support  reliably  controlled  sharing  is  intentional.  If  the  mechanisms 
for  sharing  are  nonexistent  or  severely  restricted,  .hen  a  process  going 
awry  because  of  a  hardware  fault  cannot  cause  damage  outside  its 
restricted  domain.  A  satisfactory  solution  to  the  sharing  problem  in 
the  presence  of  hardware  faults  does  not  exist,  but  the  architectural 
conifigurations  discussed  in  Section  3.3.3  seen  to  be  a  step  in  the 
right  direction. 

3. 3.2.1.  FIXED  MULTICOMPUTERS 

The  primitive  element  of  a  multicomputer  is  a  processor/msmory 
combination.  In  such  an  architecture,  the  system  can  be  protected 
against  a  processor  going  awry  by  enforcing  an  intercomputer  security 
discipline.  Moreover,  since  the  primitive  element  is  essentially  a 
self-contained  computer  system,  there  is  limited  need  for  communication 
among  the  computers  —  except  for  the  purpose  of  handling  error 
conditions  or  message  handling  involving  the  executive.  An  example  of  a 
fixed  multicomputer  is  the  Pacific  Coast  Stock  Exchange  system  COMLX 
(Wallace  A2).  Multicomputer  configurations  also  include  computer 
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networks  (e.g.,  the  ARPA  Network)  —  see  Kuo  and  Abramson  (73)..  The 
desired  fault  tolerance  for  networks  is  highly  dependent  on  the 
component  systems,  on  the  interconnection  network,  and  on  the 
applications.  Networks  are  discussed  in  Section  6.3. 

With  the  help  of  a  flexible  switching  network  between  I/O  devices  and 
the  set  of  computers,  jobs  can  be  assigned  to  an  available  computer. 
There  is  no  facility  for  one  computer  to  write  in  the  memory  of  another, 
lor  example,  the  protection  against  the  erroneous  overwriting  of  a  disk 
file  is  enforced  by  permitting  only  an  executive  to  modify  the  switching 
network.  The  executive  is  run  independently  in  aach  of  tv;c  corputers  so 
that  its  operations  are  checked,  errors  are  detected  either  by  a 
disagreement  among  executive  computers  or  by  any  self  checks 
incorporated  within  the  individual  computers  running  application 
programs.  Any  of  the  self-checks  discussed  for  a  simplex  processor 
system  could  suffice  here.  The  executive  operating  in  a  checked  mode 
could  diagnose  a  suspected  computer.  Note  that  this  executive  is 
running  at  an  extremely  low  duty  cycle,  performing  only  job  scheduling 
and  infrequent  error  control.  Each  computer  will  have  a  resident 
operating  system  to  perform  such  operations  as  loading  and  subroutine 
linkage.  The  redundancy  of  this  concept  is  quite  low  (not  exceeding 
10%)  as  measured  by  the  amount  of  hardware  and  software  devoted  to  fault 
tolerance. 

A  minor  augmentation  of  the  technique  could  provide  for  the  checking  of 
the  application  programs  if  desired  by  the  user.  In  this  case  the 
application  programs  are  run  in  two  or  more  computers.  The  local 
executive  resident  in  each  computer  (pertinent  for  this  application 
program)  periodically  reads  the  results  for  this  program  computed  by 
other  computers.  Any  disagreements  can  be  noted  in  the  memories  for 
future  disposition  by  the  system  executive.  Periodically,  the 
processors  read  from  specified  locations  in  the  executive  computer's 
memory  to  determine  if  they  should  handle  new  jobs,  become  an  executive 
computer,  or  possibly  disconnect  themselves.  The  error  control  protocol 
discussed  above  is  a  simplified  description  of  the  SIFT  system  (Wensley 
72).  The  ARMMS  system  (Martin  A2)  is  also  a  multicomputer  concept,  but 


incorporates  all  executive  functions  within  an  especially  smart 
interface  unit. 

The  multicomputer  approach  is  clean,  and  should  find  application  in 
environments  where  the  computer  system  is  a  relatively  small  portion  of 
the  total  mission  cost,  and  the  application  program  demands  are  known  to 
be  near  constant,  however,  for  other  applications  the  notable 
disadvantages  of  the  scheme  are: 

*  because  there  is  little  intercommunication  among  processors,  each 
disposable  unit  (processor  and  memory)  must  be  fairly  large  in  order  to 
represent  an  independently  viable  computer.  Thus,  it  represents  a  large 
unit  to  be  removed  upon  failure.  The  configurable  multi-computer 
discussed  in  Section  3. 3.2.2  represents  a  finer  and  more  realistic 
partitioning. 

*  Assuming  that  the  individual  computers  are  larger  than  mini¬ 
computers,  then  multiprogramming  within  a  single  computer  is  desirable 
if  the  computers  are  to  achieve  reasonable  efficiency.  However,  there 
is  a  problem  of  maintaining  isolation  between  the  processes  being 
multiprogrammed.  In  the  presence  of  faults,  such  isolation  can  be 
achieved  only  by  using  the  relatively  expensive  techniques  of  a 
replicated  simplex  system  discussed  in  Section  3.3.1. 

*  The  system  is  too  inflexible  for  variable  tasks.  For  example,  there 
is  no  way  to  vary  the  high-speed  memory  allocated  to  a  job. 

3. 3. 2. 2.  CONFIGURABLE  MULTI-COMPUTERS 

We  consider  architectures  in  which  a  set  of  computers  can  be  configured 
out  of  a  collection  of  processors,  memories,  and  (possibly)  I/O 
controllers.  The  configuration  is  accomplished  either  manually  by  an 
operator  or  by  an  executive  (in  hardware  and/or  software).  To 
accomplish  such  variable  interconnections  among  system  units,  the  system 
requires  an  interconnection  network  (e.g.,  a  cross-bar  or  restricted 
cross-bar)  between  a  set  of  memories  and  a  set  of  processors,  and 
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another  such  network  between  the  I/O  controllers  and  the  memories. 

(Some  switched  communication  links  will  also  be  required  between  the 
processor  and  I/O.)  To  inhibit  deleterious  error  propagation  from  a 
failed  processor,  the  interconnection  network  is  changed  only 
infrequently,  e.g,  when  a  new  job  is  loaded  in,  or  possibly  only  when  a 
unit  fails. 

The  fault  tolerance  procedures  for  a  configurable  multicomputer  are 
almost  equivalent  to  those  of  the  architectures  discussed  in  Section 
3. 3.2.1.  For  example,  the  CLC  computer  of  Bell  Laboratories  (see 
Ridgway  A2)  uses  a  variety  of  consistency  checks  to  detect  errors.  The 
PRIME  system  (liorgerson  A2)  relies  on  memory  parity,  periodic  diagnosis, 
and  user  complaints  to  detect  errors.  There  is  no  attempt  to  perform 
error  correction  on  the  above  systems,  so  that  the  main  forte  of  these 
systems  is  availability. 

The  redundancy  is  slightly  higher  than  that  for  a  fixed  multi-computer 
architecture,  mainly  because  of  the  need  for  extra  hardware  in  the 
interconnection  networks,  and  extra  software  to  implement  the  m^re 
advanced  reconfiguration  possibilities.  However,  aside  from  the  cost  of 
spare  units,  the  system  should  not  be  more  than  15 7.  redundant.  The 
system  is  somewhat  prone  to  faults  in  the  switching  network. 
Nevertheless,  the  effects  of  a  single  fault  in  the  switching,  network  can 
be  made  equivalent  to  a  fault  in  a  processor,  memory,  or  I/O  controller 
by  distributing  the  switches  among  the  units.  If  there  is  a  need  for 
certain  computations  to  run  concurrently  in  two  or  more  computers,  such 
an  allocation  can  be  effected  by  the  executive.  At  the  conclusion  of 
the  computation,  the  executive  can  gain  access  to  the  pertinent  memory 
modules  to  compare  the  results. 

The  performance  of  a  configurable  multiprocessor  is  better  than  that  of 
the  fixed  multicomputer  in  several  respects . 

*  The  configurable  multicomputer  offers  longer  life  for  a  given  number 
of  spares,  because  of  the  finer  partitioning.  That  is,  when  an  error  is 
discovered,  a  subsequent  diagnosis  can  pinpoint  the  fault  to  a  memory  or 
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processor  unit.  In  the  fixed  multicomputer,  an  entire  processor/memory 
pair  is  discarded. 

*  The  configurable  multicomputer  offers  the  possibility  of  adjusting  the 
main  memory  requirements  to  the  needs  of  a  job. 

*  By  virtue  of  the  interconnection  networks,  there  is  the  possibility 
for  some  interprocessor  connunl cations.  However,  the  reliability  needs 
dictate  that  this  communication  should  be  under  the  strict  control  of 
the  executive. 

Despite  the  above  advantages,  a  configurable  multiprocessor  with  the 
present  state  of  the  art  does  not  meet  the  requirements  of  many  computer 
utilities.  This  is  true  primarily  because  of  the  difficulties  of 
achieving  multiprogramming  within  each  processor,  and  of  achieving 
reliably  controlled  sharing  of  memory  among  processors. 

3. 3.2. 3.  LOOSELY-COUPLED  MULTIPROCESSORS  WITH  COMMON  MEMORY 

For  applications  in  which  most  of  the  computations  must  be  fault 
tolerant,  and  in  which  there  are  real-time  constraints  on  the 
computations,  the  several  multicomputer  architectures  discussed  above 
are  grossly  redundant.  That  is,  the  aforementioned  multicomputers 
require  that  thr  computation  be  executed  in  two  or  more  full  computers. 
This  fault-tolerance  procedure  does  not  take  advantage  of  the  low-cost 
coding  techniques  for  memory. 

Memory  coding  techniques  can  be  used  for  both  error  detection  and 
correction,  as  follows.  The  main  memory  is  either  a  large 
block-oriented  co.mon  memory  or  a  multiport  memory  that  can  communicate 
with  other  system  units  by  means  of  an  interconnection  network.  Each 
processor  unit  is  a  pair  of  processors  that  will  operate  in  a  lock-step 
mode.  Processor  errors  are  defected  by  a  disagreement  between  the 
processor  outputs.  To  ensure  that  erroneous  information  does  not 
emanate  from  the  processor  pairs,  it  is  necessary  to  suspend  the 
operation  of  the  pairs  when  the  error  is  detected.  This  can  be  achieved 
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by  some  reliable  logic  (usually  triplication)  at  the  interfaces  to 
processor  pairs  and  other  system  blocks.  This  approach  has  been  taken 
in  the  Hopkins  multiprocessor  (Hopkins  A2)  and  the  Intermetrics 
multiprocessor  (Miller  A2).  Another  approach  is  to  make  the 
interconnection  network  powerful  enough  to  isolate  a  processor  pair  in 
error.  This  approach  is  pursued  in  the  EUCS  system  (Wensley  et  al.  73). 
In  either  case,  a  processor  pair  is  discarded  if  the  fault  is  permanent. 

As  mentioned  above,  system  memory  can  take  the  form  of  a  common  memory 
or  of  a  set  of  memory  modules.  In  either  case  the  merory  information  is 
protected  with  coding  that  provides  at  least  single-byte  error 
correction.  Wien  an  error  is  detected  in  memory,  the  block  or  modu’e  in 
error  is  kept  in  service  long  enough  to  transfer  its  data  to  an 
operative  section. 

This  concept  is  less  costly  than  the  multicomputer  structures  if 
correctness  of  results  is  important.  The  actual  cost  varies  with  the 
number  of  units  needed  to  meet  the  computational  needs,  but  typically 
the  system  will  be  about  50%  redundant  with  one  spare  unit  of  each  type. 
Moreover,  these  concepts  can  be  extended  to  allow  process  sharing,  since 
each  processor’s  operation  can  be  checked.  However,  this  checking,  still 
requires  duplication  of  all  processors  —  a  cost  that  is  not  attractive 
for  general  use. 

A  common  use  of  a  multiprocessor  configuration  is  where  one  processor  is 
checking  on  the  performance  of  another  or  doing  background  work,  but  is 
prepared  to  take  over  active  performance.  Such  systems  include  Wo.  1 
ESS  (Ulrich  A2)  (with  a  monitor  processor  running  diagnostics),  and  the 
New  York  Stock  Exchange  Market  Data  System  MDS-2  (with  two  processors 
multiprocessing  and  a  third  acting  as  a  monitor). 

3.3.3.  STRONGLY-COUPLED  MULTIPROCESSORS 


The  multiprocessor  systems  discussed  in  3.3.2  are  primarily  intended  for 
the  aerospace  environment,  or  an  environment  in  which  processes  can 
function  independently.  In  the  latter  case,  the  multiprocessor 
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structure  offers  high  availability.  Sharing  is  possible  only  if  all 
computations  can  be  generated  to  operate  error-free,  which  in  turn 
generally  requires  costly  replication  for  those  multiprocessors. 

However,  in  a  modem  computer  utility,  controlled  sharing  is  extremely 
desirable.  Moreover,  it  should  not  be  necessary  to  replicate  entire 
processors  in  order  to  achieve  reliably  controlled  sharing  when  the 
programs  desiring  sharing  need  not  be  error-free. 

A  useful  example  is  provided  by  the  Multlcs  system.  Among  the  important 
currently  implemented  features  of  Multlcs  that  bear  on  sharing  and  fault 
tolerance  are  the  following: 

*  The  ring  structure  (within  a  process)  prevents  a  program  (running  in 
some  ring)  from  disturbing  a  program  that  runs  in  some  inner  ring.  In 
particular,  an  application  program  cannot  crash  the  operating  system. 

*  The  operating  system  Itself  is  layered  with  the  security-dependent 
functions  clustered  in  the  Innermost  ring.  At  present  that  ring  is  too 
large  for  our  purposes — an  issue  considered  below. 

*  The  file  system  is  fairly  immune  to  system  crashes. 

*  Processors  or  memory  module;  can  be  added  or  deleted  while  the  system 
is  in  operation. 

Aside  from  the  third  item,  these  features  are  also  included  in  the 
design  for  the  SUE  system  (Sevcik  et  al*,  72).  Multlcs  does  little  to 
support  fault  tolerance  (e.g. ,  there  is  at  present  no  instantaneous 
attempt  to  recover  from  a  parity  error  in  memory),  although  there  are 
substantial  mechanisms  for  the  integrity  of  resident  storage.  Under 
hardware  faults,  the  only  guarantee  is  that  the  system  can  eventually 
recover,  with  or  without  operator  Intervention. 

The  desired  characteristics  for  a  system  embodying  both  sharing  and 
advanced  fault  tolerance  are  the  following: 
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*  Sharing  and  protection  are  desirable,  in  the  spirit  of  Kultics. 

*  The  protection  mechanisms  should  not  be  violated  under  single  fault 
occurrences . 

*  Processes  should  be  able  to  execute  on  an  unreplicated  processor  and 
still  enable  the  protection  raechanisrts  to  be  maintained. 

*  If  correctness  is  needed  for  certain  computations,  then  such 
computations  should  execute  in  a  replicated  mode. 

*  The  individual  processors  should  be  multiprogrananable. 

*  Availability  should  be  achievable  by  the  inclusion  of  spare  modules. 

The  Vlessey  250  system  (Williams  A2)  comes  close  to  meeting  the  above 
characteristics.  It  is  a  multiprocessor  structure  with  special  hardware 
within  a  processor  to  support  a  capability-oriented  protection  scheme. 
Any  process  can  invoke  the  operating  system,  so  that  the  operating 
system  as  a  part  of  any  process  on  any  processor  The  detection  of 
error;  and  the  prevention  of  error  propagation  beyond  a  processor  is 
achieved  b’  combination  of  consistency  checks  and  special  self-tests 
within  a  processor.  For  example,  a  process  accessing  a  segment  for 
which  the  capability  does  not  exist  would  cause  an  error  indication. 

For  the  most  part  the  Plessey  250  system  operates  in  a  benign 
environment,  so  that  the  capability  checks  are  present  mainly  for  error 
detection  and  confinement  rather  than  for  bootstrap  recovery.  A 
well-designed  hierarchical  recovery  procedure  is  provided.  The  system 
is  quite  economical — less  than  25/.  redundant  and  the  error  detection  and 
recovery  procedures  have  been  evaluated  by  simulation,  however,  the 
system  still  *  ’lies  primarily  on  ad  hoc  error  detectici  procedures.  If 
these  design  techniques  are  applied  to  a  less  predictab’e  computing 
environment,  there  is  no  assurance  that  errors  will  be  caught  before 
they  cause  a  crash  or  a  security  breach,  nor  is  there  any  assurance  that 
the  recovery  can  be  carried  out  autonomously. 
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It  is  possible  to  achieve  alx  of  the  goals  by  performing  certain 
operating  system  functions  in  a  replicated  mode.  The  Camegie-Mellon 
C.mmp  multiprocessor  system  (Siewiorek  A2)  offers  some  possibility  in 
this  direction.  Briefly,  the  C.mnp  is  a  multiprocessor  in  which  a  set 
of  processors  communicate  with  a  set  of  memory  modules  via  a  crossbar 
type  network.  Certain  interconnections  can  be  inhibited  by  manual 
control  of  the  network.  Aside  from  this  manual  override  the  crossbar  is 
settable  by  a  block  address  generated  by  a  processor.  The  contents  of  a 
set  of  mapping  registers  associated  with  each  processor  determine  the 
capabilities  of  the  process  running  in  the  processor  in  question.  These 
registers  can  be  set  only  by  the  operating  system. 

The  most  significant  aspect  of  the  operating  system  is  its  kernel, 
called  Hydra  (Siewiorek  A2).  Within  its  boundaries  Hydra  contains 
sufficient  routines  to  enforce  various  protection  and  sharing 
disciplines  among  processes.  Hydra  also  offers  facilities  for  writing 
an  extended  operating  system.  Any  process  can  invoke  Hydra  on  its 
behalf.  From  a  fault  tolerance  standpoint,  all  of  the  features 
presently  in  Hydra  should  be  protected.  That  is,  hardware  faults  should 
not  induce  any  errors  in  the  operation  of  the  kernel.  In  addition  to 
the  current  functions  of  Hydra,  the  reliability  kernel  should  contain 
procedures  for  recovery,  diagnosis,  and  configuration.  Juch  of  I/O  does 
not  belong  in  the  reliability  kernel  except  possibly  a  disc  manager.  It 
is  intended  that  the  reliability  kernel  be  run  in  a  checked  mode.  The 
most  convenient  way  of  achieving  this  checking  is  to  run  the  reliability 
kernel,  when  it  is  called,  simultaneously  on  two  processors.  If  the 
memory  modules  incorporate  their  own  fault  tolerance  (probably  by  means 
of  error  correcting  codes),  the  two  distinct  memories  are  not  generally 
required.  However,  since  the  temporary  storage  memory  requirements  of 
the  kernel  are  small,  each  replicate  of  the  kernel  can  run 
simultaneously  in  its  own  processor  and  memory.  At  a  time  when  the 
kernel  can  return  values,  the  process  calling  the  kernel  can  read  the 
results  simultaneously  from  both  memories,  and  can  compare  the  results. 

A  minor  hardware  augmentation  of  C.mmp  is  required  here.  If  this 
comparison  and  the  resultant  storage  of  the  kernel  results  are  to  be 
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carried  out  reliably,  then  these  operations  themselves  must  be  checked. 
One  way  of  achieving  this  reliable  abstraction  of  the  kernel's 
computations  is  to  expand  the  capability  register  set  associated  with 
each  processor  into  a  small  duplicated  microprocessor.  The  register  set 
need  not  be  duplicated,  but  can  be  protected  by  a  simple  parity  code. 
This  duplicated  microprocessor  (distributed  among  the  processors)  can  be 
viewed  as  a  distributed  TARP  or  bus  checker.  It  is  also  necessary  to 
provide  fault  “olerance  within  the  interconnection  network,  e.g., 
trivially  by  replicating  the  network,  or  better  by  distributing  the 
network  among  the  interconnected  modules.  In  this  latter  approach 
feedback  can  be  used  to  verify  that  control  is  established  correctly. 

In  conclusion  a  multiprocessor  structure  like  C.mmp  or  Plessey  250  ccr. 
be  extended  so  as  to  achieve  all  of  the  prescribed  design  goals  at  a 
comparatively  low  additional  hardware  cost.  The  addition  of  the 
duplicated  microprocessor  and  the  extra  cost  of  fault  tolerance  in  the 
interconnection  network  should  be  equivalent  to  about  20  percent  of  a 
processor.  There  is  also  the  additional  overhead  of  executing  the 
reliability  kernel  in  two  processors — a  cost  that  is  presently  unknown 
but  should  be  low. 
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CHAPTER  A.  MEMORY  ORGANIZATION 


This  chapter  describes  the  use  of  redundancy  and  reconfiguration  in 
memory  to  increase  system  fault  tolerance.  Several  of  the  better  known 
schemes  are  treated,  and  a  new  approach  is  given  that  offers  great 
improvements  in  availability  for  large  memories. 

In  most  of  the  systems  of  interest  here,  there  is  a  diversity  of  memory 
types,  from  very  fast  small  special-purpose  memories  (e.g.,  a  cache,  or 
associative  memory  for  paging,  or  a  microprogram  control  store'  to  fast 
main  memories  to  various  slower  on-line  memories  (possibly  block 
oriented)  to  normally  off-line  storage.  A  virtual  memory  mechanism  is 
very  helpful  for  the  management  of  such  a  storage  hierarchy,  and  can 
contribute  to  economical  fault  tolerance  in  several  ways.  First,  by 
isolating  real  memory  addresses  from  user  programs,  it  contributes  to 
security,  especially  if  the  address  manipulations  are  done  reliably. 
Second,  it  simplifies  internal  reconfiguration,  replacement,  and  removal 
via  page  relocation,  increasing  operational  continuity  in  the  presence 
of  faults.  Third,  it  can  provide  a  natural  proliferation  of  different 
versions  of  data  and  procedures  that  can  be  very  helpful  in  recovery. 


A . 1 .  ERROR  DETECTION  AND  ERROR  CORRECTION  IN  MEMORY 


The  coding  art  is  well  developed  with  respect  to  realistic  codes  and 
decoding  procedures  (e.g.,  Herlekamp  68,  Peterson  and  Weldon  ??.)  .  Thus 
this  section  presents  various  conclusions  based  on  this  art,  as  well  as 
summarizing  various  aspects  of  byte  coding  for  byte-organized  memories. 

As  noted  in  Section  3.2,  there  is  a  wide  range  of  criticality  among 
various  memory  usages.  Simple  single-error  detection  or  byte-error 
detection  may  suffice  for  much  of  memory.  However,  certain  computations 
for  which  rollback  is  both  difficult  and  undesirable  may  require  error 
correction.  Further,  even  where  recovery  is  possible,  some  more 
reliable  memory  may  be  required.  Fortunately,  coding  in  memory  is 
relatively  che^p,  even  byte-error  correction  (see  below),  and  especially 
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If  used  selectively. 


Various  special  memories  have  special  needs  for  coding  techniques.  For 
example,  error  correction  may  be  neither  desirable  (because  of  decoding 
delay)  nor  necessary  in  an  associative  memory  for  which  there  is 
write-through  or  easy  restoration  of  faulty  words.  However,  capability 
for  error  detection  may  be  very  critical.  For  example,  an  error  in  the 
associative  memory  of  a  paged  system  can  drastically  affect  both  the 
system  and  its  security.  Burst  coding  (e.g.,  Elspas  and  Short  02, 
Elspas  et  al.  02,  Berelekamp  68,  Peterson  and  Weldon  08)  may  be 
effective  in  devices  with  serial  transfer. 

BYTE-ERROR  CODING 

Byte  coding  is  highly  appropriate  for  by te-per-chip  memories,  as  in  an 
LSI  chip  storing  b  bits  from  each  of  y  words  (e.g.,  b-4,  y«1024).  Here 
y  n-bit  memory  words  are  stored  in  n/b  chips.  In  some  technologies  it 
is  possible  for  a  fault  to  result  in  as  many  as  b  bits  in  a  chip  being 
in  error,  and  thus  byte  detection  or  byte  correction  may  be  appropriate 
for  the  b-bit  bytes. 

Detection  of  a  byte  in  error  within  a  word  with  k«n-r  information  digits 
requires  exactly  r*b  redundant  bits,  i.e.,  n«k+b ,  with  b  interlaced 
parity  checks.  Almos'  complete  byte-error  detection  is  achieved  with 
the  same  redundancy  using  residue  codes,  which  have  the  advantage  that 
they  are  also  useful  for  detecting  errors  in  arithmetic  (see  Avizienis 
et  al.  71,  Parhami  and  Avizienis  73).  Note  that  the  same  redundancy 
(and  in  fact  the  same  code  with  interlaced  parity  checks)  also  provides 
BLRST- ERROR  DETECTION  for  burst  errors,  i.e.,  up  to  b  errors  confined  to 
b  consecutive  positions  (cyclic  or  otherwise)  (e.g.,  see  Peterson  and 
Weldon  72).  This  is  true  even  though  b-bit  byte  errors  are  a  subclass 
of  b-bit  burst  errors. 

Byte  correction  can  always  be  achieved  with  generalized  base  B  Hamming 
b 

codes  with  B  -  2  .  The  redundancy  (in  bits)  of  these  codes  is 
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r  =  b 


f  n  b 

loV-*2  -1)  •  1 


(4.1) 


as  long  as  r  i2bj  x  denotes  the  sciallest  Integer  containing  x. 
Fewer  redundant  bits  actually  suffice  in  many  cases,  with  a  lower  hot 
given  by 

r  n  b 

r  s  loK0' IT(2  -D  +  1"  and  r  5  2b  . 

2  b  j  - 

(A. 2) 

Thi®  follows  from  the  required  number  of  distinct  error  patterns  (each 
requiring  a  distinct  SYNDROME,  or  check  pattern)  for  each  of  the  n/b 
radix  B  digits.  The  best  codes  known  are  those  of  Hong  and  Patel  (72): 
if  t  is  written  as  r_ib+c,  with  0^  c  <  b ,  and  i  an  integer,  then  the 
value  of  k  for  a  particular  value  of  r  is  given  by 

,  r  be 

,  (2  -1)  -  2  (2  -1) 
k  =  b  - - -  +  c  -  r  . 

2b  -  1  (A. 3) 

These  codes  are  shown  to  be  maximal  when  c  is  0  or  1  (in  the  sense  that 

no  such  code  with  greater  k  can  exist  for  that  r) ;  Hong  and  Patel 

conjecture  that  this  is  true  in  general.  The  redundancy  of  these  codes 

is  often  identical  to  the  bound  in  (A. 2).  Since  b*l  corresponds  to  the 

binary  Hamming  (single  bit)  error  correcting  codes,  for  which 


r  =  log9(n+l) 


1 


byte  correction  requires  roughly  b  -  lop2  b  bits  more  than  (single)  bit 
correction.  Note  that  b-bit  (cyclic  or  non-cyclic)  burst  error 
correction  requires 

b- 1 

r  2  log^ ( n  2  +1) 

bits  of  redundancy,  which  is  typically  at  least  lcg^b  -  1  bits  more  than 
byte  correction  —  cf.  (A. 2). 


Table  A.l  summarizes  the  redundancy  of  the  Hong-Patel  codes  for  typical 
values  of  k  and  small  byte  length  b.  Note  that  some  byte-correcting 
codes  with  b-2  have  the  same  redundancy  as  the  Hamming  codes  for  b*l, 
e.g.,  those  with  k  from  28  to  36,  for  which  r“6.  The  code  with  b-2  and 
k-36  is  perfect,  i.e. ,  every  non-zero  syndrome  corresponds  to  a  distinct 
correctable  byte  error.  (So  are  all  of  the  Hong  Patel  codes  with  c-0, 
corresponding,  to  generalized  Hamming  codes.)  Also  noteworthy  is  the 
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perfect  code  for  b-4,  k-60,  r-8.  Finally,  as  a  simple  illustration  of 
the  gap  between  (A. 2)  and  (4.3),  consider  the  case  of  b"2  and  k»15. 

Here  r-5  satisfies  (4.2),  but  r*6  is  required  for  this  case.  The  values 
of  k  shown  are  meant  to  be  illustrative.  If  tag  bits  are  included  in 
memory  words  (e.g.,  Feustel  73)  and  encoded,  the  actual  value  of  k  (as 
opposed  to  its  virtual  value  seen  by  data)  may  be  quite  unusual  (e.g., 

51  as  in  the  B55U0). 


Table  4.1 

SMALLEST  POSSIBLE  REDUNDANCY  r  FOR  BYTE-ERROR 
CORRECTION  IN  MEMORY  WITH  VARIOUS  BYTE  SIZES  b 


Typical 

length 

k 

1 

Redundancy  r 

b* 

2  3  4 

for 

5  ..  8 

5 

6 

6 

8 

10  ..  16 

5 

6 

7 

8 

10  ..  16 

E9 

6 

6 

7 

8 

10  ..  16 

48 

6 

7 

8 

8 

10  ..  16 

64 

7 

7 

8 

9 

10  ..  16 

128 

8 

8 

9 

9 

10  ..  16 

The  cost  of  redundant  storage  for  byte-error  correction  is  thus  seen  to 
be  relatively  small  for  b"2  and  4  (even  more  so  if  used  selectively). 

The  cost  in  time  delay  can  also  be  small.  In  fact,  if  automatic 
instruction  retry  is  available,  the  cost  in  time  can  be  effecMvely 
zero.  This  is  possible  for  systematic  codes  (for  which  the  information 
digits  are  directly  available  in  a  correct  word — as  In  the  case  of 
Hamming  codes),  by  overlapping  the  syndrome  generation  (i.e.,  error 
detection  and  implicit  location)  with  the  instruction  execution  up  until 
(but  not  beyond)  the  point  of  no  return  for  instruction  retry.  As  long 
as  syndrome  generation  completes  before  that  point  is  reached,  there  is 
no  delay  at  all  due  to  decoding  —  assuming  no  errors.  (This  requires 
pipelining  the  decoder  in  a  pipelined  machine.)  If  the  word  from  memory 
contains  an  error  resulting  in  a  nonzero  syndrome,  the  instruction 
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execution  is  interrupted,  the  word  rewritten  (correctly,  after  error 
correction)  in  memory,  and  the  retry  mechanism  is  triggered.  Thus  there 
is  a  delay  only  when  an  error  needs  to  be  corrected. 


Various  efforts  are  devoted  to  designing  fast  decoders  (e.g.,  Lossen  70, 
Hong  and  Patel  72,  Carter  et  al.  72b).  Speed  may  also  be  enhanced  by 
the  use  of  read-only  memories  in  decoding  (e,g.  Laws  72,  Mitarai  and 
McCluskey  72),  both  for  the  syndrome  generation  and  for  error 
correction,  as  well  as  by  performing  various  manipulations  on  the  parity 
check  matrix. 

The  reliability  of  decoding  for  error  correction  may  be  enhanced  by  a 
technique  of  Kautz  (62),  in  which  redundant  syndromes  are  calculated, 
providing  a  check  on  the  syndrome  generation  itself.  Such  techniques 
are  economical,  especially  since  no  redundancy  is  added  to  memory,  and 
since  the  cost  of  the  dccoder(s)  is  small  with  respect  to  the  cost  of 
memory.  Distributing  decoders  among  memory  controllers,  or  even  memory 
modules,  may  have  advantages  of  continuing  availability  of  the  system 
despite  malfunction  of  one  decoder.  Such  distribution  also  facilitates 
the  selective  use  of  coding,  by  permitting  different  encodings  for 
different  portions  if  memory.  However  it  means  that  the  busses  are  not 
checked.  See  also  Carter  et  al.  (70b)  for  self-checking  decoders. 

4.2.  MEMORY  RECONFIGURATION 

In  this  subsection  we  consider  schemes  for  reconfiguring  a  memory.  The 
memory  is  assumed  to  be  built  from  a  number  of  units  (for  example,  LSI 
chips)  each  having  the  same  memory  capacity.  When  a  fault  is  detected, 
at  least  one  unit  is  discarded  and  is  either  replaced  by  a  similar 
number  of  spare  units,  or  the  system  now  has  reduced  memory  capacity.  We 
use  the  terms  as  defined  in  Section  4.1,  with  the  following  additions. 

m  -  total  number  of  memory  units  (e.g.,  LSI  chips). 

x  ■  number  of  memory  units  in  a  block,  i.e.,  the  number  discarded 
when  a  fault  occurs. 


y  ■  number  of  words  per  block. 


w  -  the  total  number  of  words  in  the  memory 
w'  ■  the  number  of  usable  words  required  (<w). 


We  are  concerned  with  two  measures  of  reliability.  By  p[*w':wl,  we  mean 
the  probability  that  giv»t*  w  words  originally,  there  are  at  least  w' 
words  remaining  at  the  time  of  consideration.  The  second  measure  used  is 
P|-f  »  the  probability  that  f  faults  can  be  tolerated.  Schemes  for 
memory  reconfiguration  are  assessed  by  the  above  two  factors,  plus  a 
measure  of  the  cost  of  achieving  the  fault  tolerance.  We  note  further 
ti.at  the  probability  P[f]  of  being  able  to  tolerate  f  faults  is 
irrelevant  for  some  memory  structures.  Consider,  for  example,  a  block 
replacement  scheme.  All  faults  can  be  removed  from  the  memory,  although 
with  a  reduction  of  memory  capacity.  The  single  measure  Pf^'tw]  is 
therefore  a  sufficient  measure  of  reliability  for  such  a  scheme.  In 
some  other  schemes,  to  be  described  below,  the  switching  capability  is 
restricted  and  PTf.  becomes  a  meaningful  measure  of  the  ability  of  this 
switching  network  to  remove  faulty  units. 


Given  x  memory  urits  in  a  block,  the  probability  Pf  of  failure  of  a 
block  is  given  by 


Pf  “  l-(l-p) 

The  number  of  blocks  is  u  ■  m/x  *  w/y, 
blocks  contain  faults  is  given  by 

(u-i) 


/u\  i 

pfi  -  (i)  V‘-V 


The  probability  Pfi 


(4.4) 
that  i 


(4.5) 


We  use  the  notation  LaJ  to  denote  the  largest  integer  contained  in  a. 
The  probability  that  at  least  w'  words  remain,  given  w  words  originally 
is  given  by 


l(w-w')/yj 

P[>  w  :  w)  =  ^  fi 

1=0 


(4.6) 
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A. 2.1.  MEMORY  RECONFIGURATION  BY  BLOCK  REPLACEMENT 


Consider  a  memory  in  which  reconfiguration  is  carried  by  discarding  the 
block  in  which  a  fault  exists.  We  further  assume  that  the  switch 
network  that  carries  out  the  reconfiguration  can  handle  all  fault 
patterns,  ie  the  ability  to  reconfigure  is  not  constrained  by  the  switch 
but  only  by  the  availability  of  enough  fault-free  blocks.  The 
reliability  of  such  a  scheme  is  represented  by  'A. 6)  above. 

A. 2. 2.  THE  USE  OF  CODING  WITH  BLOCK  REPLACEMENT 


When  coding  is  used  for  error  detection  and/or  correction,  as  discussed 
in  Section  A.l,  it  becomes  natural  to  constrain  the  parameters  y  and  b. 
The  number  of  bits  per  chip  yb  is  de;ermined  by  the  prevailing 
technology  (value'  from  256  to  A096  are  currently  common).  With  too  low 
a  value  of  b,  the  number  of  data  lines  to  the  chip  tends  to  make  the 
number  of  words  in  a  block  large,  causing  discarding  of  an  excessive 
number  of  words  in  the  event  of  a  fault.  Too  high  a  value  of  b  has  two 
bad  effects.  First,  it  increases  the  number  of  pinn  required  for  data 
on  the  chip.  Second,  if  a  code  is  used  to  detect  and/or  correct  errors 
on  a  chip,  then  the  coding  complexity  rises.  We  therefore  have  the 
possibility  of  tradeoff,  which  is  analyzed  in  detail  in  Appendix  3. 

Consider  a  memory  constructed  using  LSI  chips,  in  which  ceding  is  used 
to  correct  errors,  and  blocks  of  memory  are  replaced  immediately  after  a 
fault  occurs.  The  analysis  in  Appendix  3  assunes  that  a  byte-error 
correcting  code  is  used.  The  number  r  of  redundant  bits  is  related  to  k 
(the  number  of  information  bits),  as  discussed  in  Section  A.l. 

Several  detailed  design  topics  are  addressed  in  Appendix  3,  particularly 
analyses  of  the  optimum  value  of  b,  and  the  value  of  F[ *w  :w  ,  given  the 
probability  p  of  chip  failure. 

The  following  conclusions  are  relevant  here. 

*  Block  replacement  strategies  for  long-life  use  (i.e.,  p  ■  .1)  require 
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very  high  redundancy  to  achieve  useful  system  success  probability. 

Other  fault-tolerance  techniques  should  be  used. 

*  For  values  of  p  <  0.01,  the  optimum  value  for  b  is  4  in  almost  all 

cases. 

*  For  mission  times  of  the  order  of  a  month  or  less  (i.e.,  p  <  .001), 
verv  high  values  of  I^^w'swj  can  be  achieved  with  less  than  50/. 
redundancy. 

*  One  advantage  of  block  replacement  is  that  the  memory  chips  do  not 
need  to  contain  special  switching  capabilities,  as  in  pome  chip 
replacement  schemes.  Another  advantage  is  the  simplicity  of  the 
reconfiguration  strategy.  The  disadvantage  of  block  replacemeet  is  that 
it  is  very  inefficient  in  its  use  of  spares,  in  tha':  nonfaulty  chips  are 
discarded  because  they  are  associated  in  the  same  block  with  a  faulty 
chip,  ke  must  therefore  consider  schemes  in  which  the  unit  of 
reconfiguration  is  smaller  than  the  block. 

4.2.3.  KECONF ICUKATION  BY  CiilF  REPLACEMENT 

A  typical  problem  in  a  chip-replacement  scheme  is  the  cost  of  the 
switching  network  required  to  replace  faulty  chips  with  spares.  The 
novel  scheme  presented  by  example  below,  and  in  general  in  Appendix  3, 
examines  the  possibility  of  economical  switching  for  reconfiguration  at 
the  chip  level.  The  primitive  element  in  the  memory  is  an  LSI  chip  that 
realizes  a  section  of  memory  b  bits  wide  by  y  words  long,  together  with 
an  address  decoder  for  the  y  words.  The  chips  (including  spares)  are 
connected  via  a  switching  network  so  that  the  memory  can  be  reconfigured 
effectively  in  the  presence  of  chip  failures.  The  main  results  relating 
to  the  switching  network  are  as  follows: 

*  The  extra  cost  of  the  switching  network  and  u  the  spare  clips  is  low, 
compared  with  a  nonredundant  memory  system. 

*  There  are  well-defined  tradeoffs  among  the  cost  of  the  switching 


network,  the  number  t  of  chip  failures  be  tolerated,  the  number  s  of 
spare  chips,  and  the  complexity  of  setting  up  the  switching  network. 

*  The  switching  networks  can  le  embedded  within  the  memory  chips,  so  as 
to  increase  the  reliability  and  increase  uniformity. 

A  TWO  DIMENSIONAL  SCHEME 

Consider  first  a  non-reconfigurable  IXI  memory  as  shown  in  Figure  4.1. 
Each  LSI  chip  contains  d  bits  of  b  words  and  a  decoder  for  the  low  order 
bits  which  are  routed  to  all  chips.  The  high  order  address  bits  are 
decoded  to  provide  activation  of  one  control  line  which  selects  the  rcw 
of  chips  that  contain  the  desired  word.  Data  is  routed  to  or  from  the 
chips  via  data  lines  shown  vertically  in  Figure  4.1. 

In  the  reconf igurable  schet’e  to  be  described,  the  chips  ~re  augmented  by 
the  incorporation  cf  two  switches  as  shown  in  Figure  4.2.  One  switch 
enables  the  chip  to  be  activated  by  one  of  three  control  lines  from  the 
decoder  or  to  be  made  inoperative  by  setting  the  svitch  to  the  null 
position.  Similarly  the  data  switch  can  be  set  to  be  connected  to  either 
•f  two  data  lines  or  to  a  null  position,  /.a  in  the  non-reconfigurable 
■eaory  the  chip  contains  a  decoder  for  the  low  order  address  bits.  The 
chips  are  assembled  into  a  memory  structure  a'  Hlustrated  in  Figure  k 
which  shows  that  the  control  lines  are  connected  to  three  rows  of  cl  ips 
and  each  data  line  is  connectable  to  two  columns  of  chips.  It  Is 
assumed  that  wrap-around  occurs  both  vertically  and  horizontally,  ie., 
the  leftmost  data  line  is  also  connected  to  the  rightmost  chips  column, 
and  similarly  for  the  top  and  bottom  control  lines.  An  extra  column  of 
chips  is  provided  that  can  be  regarded  as  spares  and  which  we  will  in 
this  discussion  regard  as  being  the  rightmost  column.  Each  spare  chip 
is  controlled  independently. 

In  discussing  the  reconfiguration  capabilities  of  the  scheme  we 
introduce  a  notation  for  lettering  chips  to  indicate  the  setting  of  the 
switches.  The  letter  of  the  alphabet  used  indicates  the  control  line  to 
which  the  chip  is  connected  by  the  control  swich.  The  use  of  upper  or 
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fig.  4.3  An  example  chip  recon f igurab] e  memory. 
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Fig.  4.4  Reconfiguration  examples. 


lower  case  letters  Indicates  that  the  chip  is  connected  to  the  left  or 
right  data  line  respectively.  An  unused  chip  is  Indicated  by  a  hyphen 
and  a  faulty  chip  is  indicated  by  an  asterisk. 

Consider  the  reconfiguration  examples  shown  in  Figure  A. 4.  The  normal 
setting  of  the  switches  is  such  that  the  chips  serve  the  data  lines  to 
their  right.  In  the  third  row  we  show  the  case  of  a  single  faulty  chip 
in  the  third  column.  The  switches  of  the  c;ilps  to  the  right  of  the 
faulty  chip  are  changed  so  that  they  serve  the  data  lines  to  their  left 
thereby  enabling  all  data  lines  to  be  served.  In  this  example  the 
reconfiguration  was  carried  out  within  a  single  row,  without  having  to 
change  the  setting  of  the  control  line  switches  in  »-he  chips.  More 
extensive  fault  patterns  must  in  general  be  handled  by  using  spare  chips 
from  adjacent  rows.  The  three  faulty  chips  of  the  fifch  row  are  handled 
by  the  following  switch  settings: 

♦Switch  all  good  chips  of  row  f  that  are  on  the  right  of  the  faulty  chip 
so  that  they  serve  the  data  line  on  their  left,  thus  replacing  one  rf 
the  'aulty  chips  and  leaving  two  vertical  busses  still  to  be  served  by 
f-driven  chips. 

♦Use  two  chips  of  the  next  row  (labelled  F)  to  replace  the  two  places  in 
the  f  row  that  have  not  been  handled  and  switch  the  g  chips  to  their 
right  to  serve  their  left  busses.  This  leaves  vertical  bus  still  to  be 
served  by  a  g-driven  chip. 

♦Use  one  chip  from  the  next  row  (labelled  G)  to  handle  the  remaining 
chip  position  of  the  g  row. 

It  can  be  seen  that  a  fault  pattern  of  n  chips  in  a  row  can  be  handled 
within  n  rows  and  further  that  the  rows  below  it  or  above  may  be  used 
for  the  reconfiguration.  In  general,  the  pattern  employed  to  accomodate 
a  fault  is  not  uniquely  determined.  The  pattern  employed  may  be  chosen 
so  as  to  better  accomodate  other  possible  nearby  faults. 

The  third  example  of  fault  patterns  shown  at  the  bottom  of  Figure  A. A, 
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illustrates  a  pattern  that  cannot  be  handled  by  this  scheme,  because 
there  is  no  chip  that  can  serve  row  L  for  data  line  4.  This  is  the 
smallest  pattern  of  faults  that  cannot  be  handled  by  the  scheme.  There 
are  indeed  many  fault  patterns  containing  more  than  b  faults  that  can  be 
handled.  For  example  all  chips  in  two  adjacent  rows  can  be  faulty  and 
successful  reconfiguration  can  take  place  so  long  as  there  are  at  least 
twice  as  many  rows  as  columns  in  the  memory. 

Appendix  3  presents  detailed  aspects  of  the  memory  organization, 
includi^r.  the  use  of  coding  to  detect  and  correct  errors,  the  setting  up 
of  the  switching  network,  and  the  relative  performance  of  this 
organization,  as  compared  with  block  replacement.  This  organization  is 
most  attractive  for  long-life  and/or  maintenance-free  applications. 

Beyond  the  number  of  chips  required  to  realize  a  given  memory  size, 
spare  chipc  are  provided  to  take  over  the  function  of  failed  chips.  The 
reconfiguration  is  achieved  with  a  switching  network  that  enables  the 
number  of  spare  chips  to  be  potentially  as  low  as  the  number  of  chip 
failures  to  be  tolerated.  As  demonstrated  in  Appendix  3,  the  cost  of 
the  switching  network  is  surprisingly  small.  Further,  the  switching 
network  car.  be  embedded  economically  within  the  memory  chips.  Thus, 
since  typically  the  number  of  chips  in  the  nonredundant  memory  is 
comparatively  large,  the  redundancy  required  to  achieve  a  tolerance  to  a 
significant  number  of  faulty  chips  is  proportionally  quite  low. 

This  type  of  memory  organization  is  particularly  applicable  to  those 
situations  where  a  large  main  memory  is  required,  and  unattended 
operation  is  required  for  periods  so  long  that  many  faults  may  be 
experienced.  Appendix  3  discusses: 

*  The  memory  model  and  a  reliability  calculation  that  demonstrates  the 
applicability  of  the  organization. 

*  Types  of  switching  networks  that  can  realize  the  reconfiguration. 

*  A  regular  switching  network  organization  that  is  particularly 
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attractive  due  to  the  ease  of  embedding  the  switching  within  each  chip. 

*  Reliability  estimates  of  the  above  regular  scheme. 

In  conclusio  we  have  determined  that  the  switch  cost  for  reconfiguring 
the  chips  of  a  memory  is  small  when  compared  with  the  total  memory  cost. 
We  have  also  shown  that  the  algorithms  for  deciding  which  switches  are 
to  be  set  can  be  simple  in  certain  cases. 

DISCUSSION  OF  SYSTEM  ASPECTS 

The  key  aspect  of  the  chip  replacement  organization  is  the  switkhinj 
network  that  effects  the  reconfiguration.  Appendix  3  gives  realizations 
of  such  networks  wherein  the  switch  cost  per  memory  chip  is  quite 
nominal,  and  whereby  the  switching  can  be  embedded  entirely  within  the 
memory  chips.  It  is  expected  that  this  organization  will  find  utility 
in  applications  with  varying  requirements  as  to  long  life,  large  memory, 
low  or  nonexistent  maintenance,  and  low  spare  redundancy.  For  modest 
requirements  for  which  only  one  or  "wo  -inhts  neee  ts  be  tshe'itee 
between  maintenance  operations,  conventional  approaches  such  as  simple 
memory  block  replacement  probably  suffice.  In  addition,  the  use  of 
low-distance  error  correcting  and  detecting  codes  may  be  desirable 
whenever  rollback  strategies  are  either  not  permitted  or  not  feasible. 

A  few  theoretical  problems  remain,  the  solution  of  which  might  lead  to 
more  efficient  use  of  this  organization: 

*  Deriving  the  minimal  switch  complexity  required  as  a  function  of  the 
memory  size,  number  of  spares,  arid  number  of  faults  to  be  tolerated. 

*  Deriving  optimal  algorithms  for  deciding  on  the  appropriate  settings 
of  switches.  It  would  be  desirable  to  determine  tradeoffs  between  the 
switching  network  complexity  and  the  set-up  algorithms. 

Perhaps  of  greater  practical  Interest  are  the  overall  system  aspects  of 
including  such  a  memory  organization  within  a  fault-tolerant  system.  We 


consider  these  issues  below,  with  some  indication  of  the  difficulty  that 
each  aspect  introduces. 

FAULT  DETECTION.  Conventional  error  detecting  and  correcting  coding 
techniques  can  be  superposed  on  the  reconfiguration.  That  is,  the 
overall  bit  length  n  can  include  code  redundancy.  A  decoder  then  checks 
each  word  on  read-out  from  memory,  in  which  case  an  immediate  indication 
is  available  of  the  block,  and  possibly  the  byte,  in  error.  The 
reconfiguration  process  can  them  remove  the  offensive  chip  and  produce  a 
new  operative  block  of  memory.  The  byte-error  correcting  codes  of 
Section  4.1  can  he  used  here.  Also  in  this  organisation  some  crucial 
sections  of  memory  can  be  given  tort  protection  by  reconfiguring  certain 
blocks  to  have  more  code  redundancy  than  others. 

SWITCHING  NETWORK  FAILURES.  Many  switch  failures  merely  disable  the  chip 
xtsclf  and  thus  can  be  handled  the  same  way  as  chip  failures.  Two 
exceptions  are  switch  failures  that  produce  a  solid  signal  on  a  data 
line  or  that  prevent  a  chip  from  being  disconnected  from  a  given  control 
line.  Such  failures  require  the  introduction  of  redundant  data  lines. 
Coding  techniques  as  described  above  can  correct  for  these  switch 
failures.  Also  a  spare  data  line  can  be  provided,  at  slight  extra  cost 
in  switching  complexity.  The  spare  line  would  be  activated  in  place  of 
a  failed  line,  in  which  case  the  network  block  that  receives  the  memory 
data  (usually  the  memory  data  register)  extracts  the  d  good  data  lines 
from  the  d+1  lines  directed  to  it. 

ADDRESS  DECODER  FAILURES.  The  memory  organization  is  clearly  sensitive 
to  failures  in  the  decoder  that  drives  the  control  lines.  It  is  likely 
that  some  of  the  decoding  function  can  be  distributed  among  the  chips, 
up  to  the  availability  of  pins.  However,  for  a  large  memory  system, 
most  of  the  decoder  will  remain  external  to  the  chips.  Since  the 
decoder  consumes  perhaps  three  to  four  orders  of  magnitude  fewer  parts 
than  the  rest  of  memory,  various  fault  tolerance  techniques  can  be 
economically  applied. 

SWITCH  SET-UP.  With  one  extra  control  line  per  block  the  switches  for 
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the  chips  can  be  set  by  applying  appropriate  signals  to  the  data  lines 
and  the  control  lines.  In  this  mode  the  switches  are  set  one  at  a  time, 
a  time  penalty  that  does  not  appear  to  be  excessive. 


SPAN  OF  RECONFIGURATION.  When  the  memory  is  reconfigured  subsequent  to  a 
failure,  a  large  portion  of  the  memory  may  have  to  be  reconfigured, 
including  operative  blocks.  This  contrasts  with  a  block  replacement 
scheme  for  which  only  the  affected  block  need  be  reconfigured.  We  have 
not  computed  bounds  on  the  number  of  blocks  that  must  be  reconfigured  in 
the  organization  considered.  However,  in  many  systems  (e.g.,  in  a  paged 
environment)  it  is  possible  to  dump  the  contents  of  the  z-1  operative 
blocks  onto  a  backup,  in  which  case  the  span  of  the  reconfiguration  is 
not  a  problem.  This  approach  is  not  feasible  in  a  real-time  environment 
where  long  down-time  (e.g.,  more  than  10  msec.)  is  unacceptable. 

In  conclusion  we  feel  that  there  are  no  insurmountable  problems  in 
incorporating  this  memory  organization  into  a  system.  The  cost  is  small 
in  a  large  memory  system,  and  may  be  justified  by  the  prevalence  of 
memory  faults  ir.  such  a  system.  The  switching  techniques  employed  in 
this  organization  are  also  generally  applicab" i  to  homogeneous  processor 
a  trays . 


4.2.4  RELIABLE  SWITCHING  CAPABILITY 

Previous  sections  discuss  how  the  memory  function  can  be  reconfigured 
either  at  the  block  or  chip  level.  It  remains  to  be  shown  that  a 
switching  scheme  can  be  designed  that  can  be  fault-tolerant. 

We  assume  the  following: 

*  Memory  blocks  containing  y  words,  each  containing  its  own  address 
decoder  logic.  Typically  y  will  vary  from  ,5K  to  4K. 

*  Control  units  which  control  access  to  the  memory  blocks.  Each  control 
unit  is  connected  to  c  memory  blocks  and  each  memory  block  is 
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connected  to  c control  units.  Some  regularity  is  assumed  in  the 
connections.  (When  c^  =  C2  •  we  use  the  single  parameter  c  to  represent 
both.  Under  these  circumstances  the  number  of  memory  blocks  equals  the 
number  of  control  units,  and  the  two  units  could  be  constructed  as  a 
single  module.  Possible  structures  for  c  =  3  and  4  are  shown  in  Figures 
4.5  and  4.6.) 

*  A  data  bus  structure  which  connects  to  all  memory  blocks. 

*  A  block  address  structure  which  connects  to  all  control  units. 

*  A  page  address  structure  that  connects  to  all  memory  blocks. 

*  Control  logic. 

In  all  regular  arrays  of  control  units  and  memory  blocks,  it  is  assumed 
that  all  edge  connections  are  "wrapped  around"  (i.e.,  that  linear 
structures  are  mapped  onto  a  ring,  two-dimensional  structures  are  mapped 
onto  a  toroid,  and  so  on). 

The  mode  of  operation  is  explained  in  terms  of  a  'READ'  from  memory. 

Each  control  unit  contains  registers  (Block  Address  Registers,  BAR) 
whOvie  contents  are  the  block  addresses  of  the  memory  blocks  to  which  it 
is  connected.  The  block  address  of  the  required  word  is  transmitted  to 
all  control  units,  where  a  comparison  is  made  with  BARs,  and  if  a  match 
is  found,  an  enable  signal  is  transmitted  to  the  relevant  memory  block. 
Under  no  fault  conditions  a  selected  memory  block  will  receive  c  enable 
signals.  The  page  address  (i.e.  the  low-order  bits  of  the  address)  is 
transmitted  to  all  memory  blocks.  The  selected  memory  block  reads  the 
selected  word  and  places  the  word  on  the  data  bus.  The  operation  is  now 
complete. 

It  is  assumed  that  each  memory  block  is  tolerant  to  a  number  of  faults 
(e.g.,  one)  but  that  a  larger  number  of  faults  will  cause  it  to  be 
inoperative.  The  primary  purpose  of  the  control  unit  is  to  allow 
reconfiguration  of  the  memory.  This  reconfiguration  is  achieved  by 
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changing  the  contents  of  the  BARs  in  the  control  units  which  are 
connected  to  the  memory  blocks  whose  addressee  must  be  changed.  A  fault 
in  a  control  unit  could  result  in  an  error  in  the  'enable'  signal  sent 
to  a  memory  block.  To  prevent  such  a  fault  from  causing  errors,  a 
voter  is  used  in  the  memory  block.  Note  that  the  voter  examines  only 
the  enable  line.  The  connection  of  a  memory  block  to  the  data  bus 
structure  can  also  be  controlled  by  the  multiple  enable  signals,  thereby 
preventing  a  faulty  memory  block  from  erroneously  seizing  the  bus 
structure. 

Table  4.2  summarizes  the  fault  tolerance  of  an  example  of  the  type  of 
memory  system  described  above.  We  assume  that  LSI  chips  will  be  used 
with  4K  bits/chip.  We  consider  an  example  with  32  bit  words  and  256k 
words,  and  we  ignore  the  cost  of  the  error  correcting  encoder/decoder 
circuits.  We  are  concerned  with  two  measures,  first,  the  probability 
that  a  particular  failure  mode  will  occur,  and  second,  what  the  effect 
of  that  failure  will  be. 

REDUNDANCY. 


*  Unprotected  memory 


2048  chips 


*  Memory  protected  by  byte  codes  «*  2560  chips 

(20%  redundancy) 


*  Memory  protected  by  byte  codes 
plus  reconfiguration 


■  2624  chips 
(23%  redundancy) 


HARDCORE,  The  structure  as  outlined  contains  address  propagation 
circuits.  As  shown,  these  circuits  do  not  possess  any  reconfiguration 
capability.  In  this  sense  they  represent  the  "hardcore"  of  the  system, 


Table  4„2  (  w  =  4K,  Total  Storage  =  256K,  c  =  4,  Word  =  32  bits) 


Failure  Mode 

Single  chip  failures  in  a  particular 
memory  block 

Single  chip  failure  in  any  memory 
block 

Two  chip  failures  in  same  block 
before  reconfiguration 


Control  unit  fault 


Probability/Hr 

-5 

4  X  10 

— 

2.6  X  10 

3  X  10-11  X  T 
(T  =  time  (in  secs) 
to  reconfigure) 

6-4  X  10“5 


External  Effect 


Loss  of  block 
plus 

Loss  of  data 


Two  adjacent  control  vnl t  faults 
(adjacent  means  two  control  units 
which  are  connected  to  a  common 
memory  block) 


5  X  10" 


Possible  loss  of 
block;  possible 
loss  of  total 
memory 


VIRTUAL  MEMORY.  The  control  units  map  virtual  block  addresses  to 
physical  block  addresses.  These  units  can  therefore  be  used,  with  no 
increase  in  cost,  to  implement  virtual  addressing  and  paging,  without 
need  for  any  other  "paging  box"  or  its  equivalent. 


FAULT-TOLERANT  DATA  LINE  STRUCTURES 

I 

A  reliable  memory  system  can  be  built  in  which  the  memory  chips 
themselves  can  be  reconfigured  if  a  fault  occurs.  A  further  potential 
cause  of  failure  of  the  memory  is  the  failure  of  the  data  lines  both 
into  and  from  the  chips.  Such  failures  would  tend  to  be  less  frequent 
because  the  amount  of  equipment  involved  is  much  less  than  in  the  memory 
function  itself.  Thus,  for  some  applications  it  is  not  necessary  to 
protect  against  the  failure  of  these  lines,  while  in  more  stringent 
applications  a  means  must  be  provided  to  carry  out  some  protection.  The 
data  lines  may  fail  in  two  ways.  First,  the  equipment  in  those  lines 
may  itself  become  faulty.  Second,  a  failure  of  one  or  more  of  the 
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memory  chips  may  cause  a  data  line  to  be  subjected  to  erroneous  signals. 

We  can  consider  two  prime  ways  in  which  the  data  lines  can  be  protected 
against  faults,  by  coding  or  by  the  use  of  redundant  lines.  In  both 
cases  equipment  must  be  added,  to  carry  out  the  encoding  and  encoding, 
or  to  switch  the  redundant  lines.  The  probability  of  faults  in  this 
additional  equipment  may  be  greater  than  in  the  data  lines  that  are  to 
be  protected,  and  careful  analysis  must  be  carried  out  to  determine  if 
such  equipment  is  therefore  justified. 

CODING  ON  THE  DATA  LINES.  The  use  of  a  code  for  single— error  correction 
and  double-error  detection  on  the  lines  protects  against  any  single  data 
line  presenting  spurious  data.  Such  a  code  is  quite  economical  for  all 
reasonable  word  lengths. 

REDUNDANT  DATA  LINES.  The  addition  of  a  single  data  line  can  easily  be 
incorporated  into  some  memory  schemes  such  as  the  chip  replacement 
scheme  discussed  above. 
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CHAPTER  5.  ARITHMETIC  AND  LOGIC 


While  very  low  redundancy  in  memory  produces  significant  improvements  in 
system  fault  tolerance,  arbitrary  logic  may  require  full  duplication  for 
instantaneous  error  detection  in  any  one  unit,  and  full  triplication  for 
instantaneous  erroi  correction  in  any  one  unit.  Fortunately,  there  are 
several  factors  that  may  help  to  reduce  the  cost  of  redundancy: 

(a)  Detection  may  not  be  uniformly  critical  in  time  and  space.  For 
example,  partial  detection  may  suffice,  detecting  only  certain  faults, 
or  detecting  a  fault  within  some  period  of  time.  Similarly,  some  faults 
may  be  more  critical  than  others.  Also,  within  a  particular  scope  of 
computation  (e.g.,  an  instruction,  a  subroutine,  a  block,  a  domain 
within  a  process,  or  a  process),  detection  may  be  required  only  on  exit. 

(b)  Instantaneous  correction  may  be  unnecessary,  especially  when  good 
facilities  are  available  for  recovery  and  retry  (with  or  without 
diagnosis) . 

(c)  Considerable  flexibility  arises  in  the  use  of  reconfiguration  of 
units  (e.g.,  through  changeable  microcode),  with  tradeoffs  among  degrees 
of  redundancy,  performance,  and  functional  completeness. 

(d)  Many  systems  seem  to  be  dominated  by  the  costs  of  memory.  Thus, 
greater  relative  redundancy  in  arithmetic,  logic,  and  control  may  have 
little  impact  on  the  overall  cost  of  the  system. 

(e)  Automatic  retry  of  an  instruction  during  which  an  error  has  been 
detected  is  both  powerful  and  economical.  Its  primary  requirement  is 
that  the  initial  operands  (e.g.,  in  registers  or  memory  locations) 
should  not  be  overwritten  during  instruction  execution  —  or  at  least 
should  be  recoverable  from  somewhere  in  memory. 

These  factors  ar*£  found  to  some  extent  in  existing  systems,  but  usually 
in  isolation  rather  than  as  part  of  a  systematic  methodology  for  fault 
tolerance . 
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5.1.  DETECTION  AND  CORRECTION  OF  ERRORS  IN  ARITHMETIC 


ithmetic  operations  may  be  checked  by  duplication  and  comparison,  with 
hardware  redundancy  of  about  55%.  Duplication  detects  all  errors  in  any 
one  unit,  but  fails  tc  detect  identical  errors  in  each  of  the  two  units. 
The  use  of  dual-rail  logic  is  also  possible,  with  hardware  redundancy 
about  37%  and  42%  cited  in  an  arithmetic-logic  unit  for  64-bit  and 
32-bit  words,  respectively  (Carter  et  al.  70).  However,  the  class  of 
faults  covered  it  significantly  less.  The  use  of  residue  codes  (e.g., 
Avizienis  71)  can  be  effective,  with  redundancy  in  the  range  between  10% 
and  25%.  A  residue  code  has  the  advantage  that  it  is  also  error 
detecting  i^  used  in  memory.  For  example,  in  a  byte-organized  memory 
with  b~bit  bytes,  the  use  of  the  residue  2b  -  1  detects  all  errors  in  a 
byte  except  for  the  error  which  substitutes  the  all-zero  byte  for  the 
all-one  byte,  or  vice  versa.  The  cost  in  memory  is  one  redundant  byte. 
(This  cost  is  the  same  for  complete  byte-error  detection  in  memory, 
using  b  interlaced  parity  checks  -  which  however  do  not  detect  errors 
in  arithmetic.)  For  bit-serial  and  byte-serial  arithmetic,  duplication 
is  both  cheap  and  effective.  For  byte-serial  arithmetic,  residue  codes 
are  also  of  value  (e.g.,  Avizienis  et.  al.  71).  A  would-be  problem  of 
multiple  errors  resulting  from  a  single  fault  can  be  overcome  by  the  use 
of  the  (2  -  1) 's  complement  of  the  residue.  (See  Avizienis  71  for  the 
use  of  inverse  residue  codes  for  repeated-use  faults.)  For  parallel 
arithmetic,  residue  codes  may  offer  substantial  cost  advantages  over 
duplication,  although  care  must  be  taken  in  carry-look-ahead  schemes  to 
avoid  unchecked  multiple  errors  resulting  from  a  single  fault  (cf. 

Langdon  and  Tang  70);  otherwise  duplication  may  again  be  preferable. 

Byte-organized  processing  is  advantageous  for  integrated  circuit 
implementations ,  and  is  also  well  suited  to  carry-look-ahead  schemes. 

In  a  byte-organized  arithmetic  unit  with  bytes  of  length  b,  multiple 
errors  may  arise  in  a  single  byte  slice  (e.g.  on  a  single  chip).  These 
detectable  by  residue  codes  with  a  residue  at  least  2b  and 
tively  prime  to  2  .  If  the  all-zero/all-one  substitutions  are  of 
negligible  likelihood,  the  -esidue  2b  -  1  ls  ideal.  (If  they  are 
likely,  then  alternating  the  physical  encoding  for  a  "1"  in  successive 
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bit  positions  may  be  useful.)  Note  that  duplication  provides  BYTE-ERROR 
LOCATION,  since  the  error  is  in  the  lowest-order  byte  position  in  which 
a  discrepancy  exists.  However  this  could  result  from  a  fault  in  either 
of  the  two  units,  and  then  either  in  the  byte  or  in  the  carry  into  the 
erroneous  byte,  so  that  duplication  is  not  FAULT  LOCATING.  Byte-error 
locating  arithmetic  codes  that  are  not  also  byte-error  correcting  (see 
below)  do  not  otherwise  seem  to  exist  (Neumann  and  Rao  73). 

If  correction  of  arithmetic  errors  is  required,  triplication  is  clearly 
one  alternative.  There  is  also  on  extensive  theory  of  error-correcting 
arithmetic  codes.  Such  codes  typically  require  a  cost  roughly 
equivalent  to  duplication  of  the  arithmetic  unit  (Rao  70),  instead  of 
triplication  (plus  voting).  These  codes  may  also  be  used  for  error 
detection,  detecting  a  wide  range  of  multiple  errors  at  much  lower 
decoding  cost.  For  byte-organized  arithmetic  units,  the  recent  work  of 
Neumann  and  Rao  is  applicable,  providing  codes  for  byte-error  correction 
in  arithmetic.  See.  Appendix  4  for  an  extended  version  of  Neumann  and 
Rao  73.  (A  notation  gap  exists  between  the  literature  on  memory  coding 
and  that  on  arithmetic  coding,  which  has  regretfully  been  perpetuated.) 


The  suitability  of  such  byte-correcting  arithmetic  codts  is  not 
uniformly  clear ,  It  depends  on  the  particular  byte  sizes  and  word 
lengths,  and  on  the  type  of  decoding.  The  redundancies  required  for 
various  codes  are  compared  in  Table  III  of  Appendix  4.  Included  are  the 
minimum  redundancy  byte-correcting  codes  for  memory  (Hong  and  Patel  72, 
see  Table  4.1),  denoted  by  "M"  in  the  table;  the  AN  and  gAN  codes  with 
Ab“  (2  -l)p  (denoted  by  "A");  bi-residue  codes  with  arbitrary  residues 
2-1  and  p  (  R") ;  multi-residue  codes  with  generalized  "low-cost" 
residues  of  the  form  of  expression  (11)  of  Appendix  4  (denoted  by  "G")  ; 
and  those  muUi-residue  codes  with  only  low-cost  residues,  of  the  form 
2-1  and  2  1  (denoted  by  "L") . 

The  byte-correcting  arithmetic  codes  also  provide  byte-error  correction 
when  used  in  memory.  Some  of  these  codes  have  redundancy  very  close  to 
the  comparable  byte-error  correcting  codes  for  memory.  Such  codes  thus 
have  potential  for  efficient  dual  use,  both  in  memory  and  in  arithmetic. 
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Advantages  of  such  dual  use  are  discussed  by  Avizienis  et  al.  (71), 
particularly  with  respect  to  the  residue  15  error-detecting  code  used  in 
the  J PL- STAR  (Avizienis  A2).  Other  codes  require  substantially  more 
redundancy,  in  which  case  they  are  not  appropriate  for  such  dual  use. 
Nevertheless  they  remain  of  interest  for  byte-organized  arithmetic. 

As  a  favorable  example,  consider  the  length  k«=42,  with  2-bit  bytes. 

Here  7  bits  of  redundancy  are  required  for  byte  correction  in  memory, 
while  8  bits  are  needed  for  several  forms  of  arithmetic  byte  correction 
(A,  R,  G).  In  particular,  the  radix  4  byte-correcting  multi-residue 
arithmetic  code  with  low-cost  residues  3  and  49  -  7x7  has  the  remarkable 
property  that  byte-error  detection  in  arithmetic  and  memory  is  obtained 
simply  by  taking  residues  module  7.  Thus  byte-error  detection  alone  is 

cheap  and  fast,  with  correction  available  if  desired.  Other  exemples 
are  cited  in  Appendix  4. 

There  is  also  recent  work  on  burst-error  correcting  arithmetic  codes 
(e.g.,  Bow  73),  although  that  is  probably  of  less  interest  here. 

In  general,  arithmetic-error  detection  is  highly  advantageous.  Error 
correction  may  be  needed  only  rarely,  especially  if  instruction  retry  is 
possible  in  the  case  of  intermittent  faults,  or  if  alternate  means  are 
available  in  the  case  of  permanant  faults.  Such  alternate  means  may 
include,  for  example: 

(a)  Switching  a  spare  byte  slice  to  replace  a  faulty  one,  e.g.,  using 
the  rippler  of  Stiffler  (73).  An  extreme  example  is  that  of  a  cyclic 
loop  of  n+1  stages;  when  broken  by  a  faulty  stage,  there  are  still  n 
consecutive  correct  stages.  However,  there  are  problems  here  in 
switching  on  read-in  and  read-out.) 

(b)  Removing  the  faulty  byte  slice,  with  either  a  degradation  in 

precision,  or  the  use  of  multiple-precision  operations  (possibly  in 
microcode) . 

(c)  In  a  duplicate-unit  environment,  discarding  the  faulty  duplicated 
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unit,  leaving  full  computational  capacity,  but  no  checking  capability. 

Whenever  the  cost  of  arithmetic  units  is  typically  small  compared  with 
memory  costs,  the  reliability  and  availability  goals  can  freely 
influence  the  design.  Nevertheless,  the  cost  of  logical  (functional) 
duplication  need  not  be  physical  duplication.  For  example,  a  fast 
parallel  arithmetic  unit  may  be  checked  by  a  slower  byte-serial  unit, 
with  disagreement  triggering  an  instruction  interrupt.  In  some  cases 
(e.g.,  if  the  result  is  being  written  into  a  memory  much  slower  tlun  the 
arithmetic  unit),  simple  instruction  retry  may  suffice.  In  some  cases 
(e.g.,  in  a  pipelined  environment),  some  rollback  may  be  required, 
although  this  can  be  minimized  by  judicious  use  of  registers  and  memory. 

In  summary,  high  availability  results  from  a  multiplicity  of  units,  or 
multiple-precision  modes  among  degraded-performance  units  with  removed 
byte-slices.  High  reliability  results  from  the  use  of  error  detection 
with  retry,  rollback,  and  reconfiguration,  and  with  error  correction 
possible  in  extreme  cases.  Probabilistic  detection  may  be  adequate. 
Periodic  interspersed  diagnosis  provides  a  useful  enhancement  when 
detection  is  not  available  directly. 

5.2.  ERROR  DETECTION  IN  LOGIC  OPERATIONS 

For  logic  operations,  duplication  is  necessary  for  error  detection  in 
some  cases,  while  coding  does  not  work  —  except  for-  modulo-two  linear 
oper- Lions  (e.g.,  exclusive  OR).  Dual-rail  logic  (Carter  et  al.  72) 
seems  valuable,  with  costs  potentially  less  than  duplication  for  error 
checking.  In  some  cases,  consistency  checks  are  available.  In  other 
cases,  partial  detection  is  acceptable,  at  relatively  low  cost  (cf. 
Carter  et  al.  71a).  In  such  cases,  detection  is  not  immediate,  but 
occurs  in  a  probabilistic  sense  within  a  specified  period  of  time.  As 
in  the  case  of  arithmetic,  where  alternate  means  are  available  for 
permanent  faults,  various  alternatives  are  available  for  logic.  These 
include  using  spare  byte-slices,  performing  (possibly  micro-programmed) 
two-step  operations  on  a  half  unit,  and  (in  a  duplicated  mode) 
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discarding  a  faulty  duplicate  unit  to  run  simplex. 


Still  another  alternative  is  available  for  logic,  using  the  arithmetic 
unit  to  perform  logic  operations  (e.g.,  micro-programed)  when  a  logic 
unit  is  not  available.  If  the  arithmetic  unit  is  checked,  it  follows 
that  the  logic  operations  are  also  checked,  as  seen  below.  It  is  well 
known  that  all  logic  operations  may  be  derived  arithmetically,  given  for 
example,  the  bit-wise  operation  x  *  y,  e.g.: 

Xvy  «  (x+y)  -  (x.y),  (5.1) 

x©y  "  <x+y)  -  2*  (x  ->y). 

Here  "v"  and  "((")"  denote  INCLUSIVE  OR  and  EXCLUSIVE  OR,  respectively. 

The  remaining  operations  are  normal  arithmetic  addition,  subtraction, 
and  multiplication  ("+",  respectively).  Complementation  is 

easily  obtained  when  ONE's  or  TWO’s  complement  representations  are  used. 

Monteiro  and  Rao  (72)  have  examined  a  realization  of  logic  operations 
using  a  residue- checked  arithmetic  unit  and  an  AND  circuit  to  produce 
checked  arithmetic  and  logic  operations.  If  logic  operations  are 
relatively  infrequent,  little  performance  degradation  is  required  to 
perform  checked  logic  in  arithmetic.  Since  the  AND  operation  x*y  can 
be  available  as  a  byproduct  of  the  arithmetic  unit,  e.g.,  when  the  sum 
is  obtained  as 

z  ■  x+y  «  (x©y)  +  2  *(x  «  y)  ,  (5.2) 

it  ia  possible  to  generate  all  logic  operations  without  the  extra  AND  of 
Monteiro  and  Rao,  although  it  is  of  course  desirable  to  augment  the 
byproduct  AND  output  with  the  correct  residue  check  digits.  For  various 
implementations  of  (5.2),  the  incorrectness  of  x  results  in  the  sum 
z  *  x+y  being  in  error.  If  the  error  is  detectable  (e.g,  via  the 
residue  check  on  the  result) ,  retry  and  reconfiguration  may  be 
initiated,  as  warranted.  Similarly,  an  error  in  arithmetic  during  the 
formation  of  a  logic  operation  other  than  "a"  may  be  detected  by  the 
arithmetic  checks  on  the  successive  arithmetic  operations.  However, 
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arithmetic  overflow  (one  bit)  must  be  covered  by  the  residue  code.  An 
overflow  may  arise  temporarily  during  the  sequence  of  operations  (e.g., 
in  (5.1)),  but  disappears  in  the  final  result. 

This  approach  is  extendable  to  byte-error  detection  and  correction. 
However,  in  cases  of  multiple  faults,  it  is  necesssary  to  assure  that 
x  -y  is  correct  independently  of  the  correctness  of  x+y.  For  example,  a 
pair  of  cancelling  (but  rare)  errors  would  not  be  detected  by  the 
residue  check  on  x+y,  e.g.,  +1  in  position  i+1  of  x@y,  and  -1  in 
position  i  of  x-y  in  (5.2). 

A  final  word  is  appropriate  on  the  impact  of  technology  on  the  relevance 
of  the  schemes  discussed  here.  On  one  hand,  selective  replication  may 
be  relatively  economical.  On  the  other  hand,  the  trend  toward 
increasing  the  number  of  functions  per  device  may  make  the  use  of 
duplication  of  gates  or  busses  less  profitable  if  the  multiple  versions 
of  a  function  are  all  on  a  single  device.  This  is  because  there  tends 
to  be  a  high  correlation  among  faults  within  single  devices.  Further 
limitations  on  some  of  the  techniques  described  here  will  be  felt 
because  of  the  limitations  on  the  number  of  pins  available  per  internal 
function  in  the  new  technology. 


CHAPTER  6.  EXAMPLES  OF  FAULT-TOLERANT  COMPUTERS 


In  the  subsections  of  this  chapter,  we  discuss  fault-tolerance 
requirements  for  computers  used  in  different  applications.  Our 
viewpoint  is  that  the  different  applications  have  different  requirements 
for  reliability,  availability,  data  protection,  maintainability,  etc., 
and  different  opportunities  for  the  use  of  fault-tolerance  techniques. 
These  different  requirements  and  opportunities  result  in  a  variety  of 
computer  architectures.  In  effect  we  see  that  a  single  'best' 
fault-tolerant  computer  design  is  not  possible.  However,  time-sharing 
systems  possess  nearly  all  the  requirements  of  fault  tolerance  of 
computers  in  general.  Therefore,  we  discuss  them  first  and  treat  other 
computer  types  as  variants. 

Each  subsection  deals  with  a  different  application  class  — 
general-purpose  time-shared,  general-purpose  batch,  communication, 
super-fast  and  aerospace.  For  each  application  class  we  discuss  the 
most  common  requirements  and  the  most  appropriate  architectures  to 
satisfy  these  fault-tolerance  requirements.  Table  6.1  is  a  summary,  in 
very  compact  form,  of  the  material  of  this  section.  The  parameters 
quoted  (e.g.,  speed,  memory  size)  are  intended  to  be  the  most  common 
without  implying  that  examples  outside  the  range  cannot  occur.  The 
techniques  that  are  appropriate  for  each  application  class  have  been 
discussed  in  detail  in  the  foregoing  chapters.  Here  we  attempt  for 
specific  applications  to  evaluate  some  of  the  architectural  types 
discussed  in  Section  3.3.  In  addition  Appendix  3  contains  detailed 
considerations  in  the  design  of  fault-tolerant  memory  systems. 

Certain  properties  are  common  to  all  classes  of  computers  (ox 
applications)  of  which  the  following  are  the  most  important  from  the 
standpoint  of  fault  tolerance: 

*  Central  memory  frequently  dominates  the  cost  of  the  system,  but  is 
also  the  unit  that  is  most  easily  and  economically  protected.  Selective 
and  dynamic  use  of  coding  can  be  very  cost-effective. 
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*  The  arbitrary  logic  of  central  processors  is  the  most  difficult  to 
protect  but  often  represents  a  small  proportion  of  the  total  cost. 
Thus,  replication  is  practical  for  many  applications.  Selective 
replication  seems  practical  except  when  all  usage  has  uniform 
criticrlity. 

*  Faults  in  most  peripheral  equipment  (e.g.,  printers,  magnetic  tape 
units,  modems)  are  best  handled  by  providing  spares  and  reconfiguring. 

*  In  most  multiprogrammed  systems,  a  vital  component  is  the  drum  or 
other  large  storage  device  that  is  used  for  swapping.  We  therefore  . 
consider  the  effect  of  faults  in  that  unit  for  time-sharing 
applications. 


in  view  of  the  above  common  features,  it  is  practical  to  consider  a 
representative  computer  system  and  then  treat  other  types  as  variants 
upon  it  (from  a  fault-tolerance  standpoint).  As  such  a  computer,  we 
take  one  of  about  the  scale  of  the  Multics  system  (Saltier  A2) .  It  is 
recognized  that  Multics  is  larger  than  the  average  installation. 

However,  it  represents  a  suitable  system  on  which  to  apply 
fault- tolerance  techniques,  because  the  loss  of  availability  or  of  files 
is  significant.  In  addition,  the  cost  of  Multics  precludes  the  use  of 
crude  replication  techniques.  Because  we  are  considering  future 
computers,  we  assume  that  LSI  techniques  will  be  used  wherever  possible. 
Such  use  of  LSI  includes  electronics  associated  with  peripheral 
equipment  having  no  stringent  speed  requirements,  as  well  as  memory, 
where  we  can  take  advantage  of  the  regularity  of  structure. 

In  the  central  processors  it  will  generally  be  necessary  to  use  the 
faster  MSI  logic  technology.  For  a  system  on  the  scale  of  Multics,  tlU 
analysis  of  the  use  of  different  fault-tolerance  techniques  is  given  ip 
detail  in  Section  6.1.  Treating  this  illus-.rative  computer  as  in  some 
sense  typical  of  computers  in  general.  Table  6.2  illustrates  the 
effectiveness  of  different  techniques.  The  techniques  are  discussed  in 
earlier  sections  of  this  report.  In  examining  the  probabilities  of 
er  or,  nonavailability,  etc.,  we  do  not  quote  values  lass  than  10  8/hr. 
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TABLE  6-2 

EVALUATION  OK  FAULT- TOLERANCE  TECHNIQUES 
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1.  Recovery  time  is  dependent  on  time  required  to  reload  from  a  previous  known  correct  slate.  Typically  a  feu  seconds 
should  suffice. 

2.  This  1.  .he  recovery  lima  lor  memory  failures.  For  dctecte-*  processor  failures  external  maintenance  Is  required. 

3.  We  assume  that  a  duplexed  multicomputer  Is  unavailable  when  fewer  than  2  complete  units  remain  operative. 

f.  The  application  program  recovery  time  Is  dependent  on  the  time  It  takes  the  uaer  to  detect  an  error  plua  the  re- 
starting  time. 


5.  The  recovery  time  Is  dependant  on  the  fault  location.  .nc  values  given  he 
ties. 


re  are  avaragad  over  all  fault  posalbill- 
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This  represents  such  a  low  probability  that  the  event  would  be  expected 
to  occur  once  every  10,000  years. 

6.1  GENERAL-PURPOSE  TIME-SHARED  COMFJTERS 

The  most  general  case  is  that  of  general-purpose  systems  with 
interactive  and  noninteractive  use.  Many  other  systems  can  be 
considered  as  special  cases  of  such  systems. 

REQUIREMENTS 

With  all  expensive  equipment,  there  is  an  economic  need  for  reliability. 
An  additional  requirement  is  for  the  integrity  of  data.  A  constraint 
derives  from  the  fact  that  a  time-shared  computer  may  frequently  be  used 
by  many  users.  A  loss  of  control  or  data  may  result  in  the  effective 
loss  of  several  hours  work  of  these  users  —  a  severe  penalty.  Valid 
control  and  high  integrity  of  data  are  therefore  vital.  The  manager  of 
such  a  system  should  be  prepared  to  pay  more  to  protect  against  faults 
than  would  the  manager  of  a  strictly  noninteractive  (batch)  system. 

Many  time-sharing  computers  are  used  for  long-term  information 
processing  rather  than  short-term  computing.  The  long-term  protection 
of  data  is  therefore  of  vital  importance.  This  is  typically  achieved  by 
recording  back-up  data  and  program  files  on  disc  or  tape  at  regular 
intervals.  Another  aspect  of  the  need  to  protect  data  files  is  the 
protection  that  must  be  maintained  against  loss  of  data  because  of  the 
actions  of  other  users,  or  an  errant  operating  system,,  either  possibly 
being  caused  by  a  hardware  fault  condition.  We  see  solutions  to  these 
problems  through  restricting  the  physical  address  spact  accessible  to 
each  component  of  the  system.  In  terms  of  the  concept  of  levels  in 
Section  3.2,  we  need  to  assure  by  suitable  hardware  means  that  low-level 
software  (e.g.,  that  controlling  the  physical  allocation  of  resources) 
must  be  very  reliable,  while  higher  levels  must  be  constrained  to 
operate  only  in  the  domain  allocated  to  them  by  the  low  level  software. 
The  protection  of  the  operating  system  is  therefore  the  most  crucial 
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fault-tolerance  requirement  of  the  system.  The  Multics  syrtem  is  a 
current  example  of  a  system  that  recognizes  the  need  for  protection  of 
the  innermost  levels  of  the  executive  against  erroneous  operation  at 
outer  levels.  However,  the  protection  in  Multics  is  against  software 
errors  at  outer  levels  and  against  malevolent  users,  not  against 
hardware  faults.  The  solution  for  hardware  faults  is  to  provide  a 
system  in  which  the  redundancy  is  variable  with  time  so  that  the 
low-level  parts  of  the  operating  system  can  be  protected  without 
incurring  redundancy  for  users  who  do  not  require  the  protection. 

Consider,  as  a  central  example,  a  system  of  structure  similar  to 
Multics,  initially  with  the  following  specifications: 

1  Central  processor 

384K  Words  of  memory,  each  32  bits 

1  Unit  for  file  storage 

1  Drum  or  disc  for  swapping 

We  further  assume  that  LSI  circuitry  is  used  throughout  for  all  units 
except  the  central  processor,  where  the  faster  MSI  technology  is  used. 

We  can  estimate  the  chip  count  for  an  irredundant  realization  as 
follows . 

Processor .  2000  chips 

Memory  .  3072  chips 

Disc  control  ..  20  chips 

Drum  control  ..  20  chips 

Total  ...  5112  chips 

Note  that  the  above  estimates  are  intended  only  to  give  the  order  of 
magnitude  of  the  system  components,  no  greater  accuracy  being  required 
(or  intended).  For  simplicity,  we  assume  in  the  following  that  the 
memory  chips  are  organized  as  1024  bytes  each  of  4  bits.  This 
assumption  is  not  critical,  because  other  configurations  of  chips  would 
yield  very  similar  results  in  the  reliability  analysis.  To  a  first 
approximation,  we  can  assume  that  system  error  rate  will  be  in  direct 

proportion  to  the  number  of  chips  employed,  and  we  assume  a  failure 
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probability  of  10  per  chip  per  hour.  We  present  the  various  design 
concepts,  and  the  techniques  to  be  applied  in  a  number  of  stages.  The 
result  of  applying  various  fault-tolerance  techniques  is  shown  in  Table 
6.2,  and  illustrated  graphically  in  Figures  6.1.  The  method  of 
presentation  is  to  examine  a  succession  of  stages  of  adding  redundancy, 
in  some  cases  to  improve  the  probability  of  correctness,  in  others  to 
improve  the  probability  of  availability,  and  in  others  to  decree/ie  the 
recovery  time  after  a  failure.  These  stages  range  from  techniques 
applied  to  a  simplex  system  (Section  3.3.1)  to  the  multiprocessor 
concepts  discussed  in  Sections  3.3.2  and  3.3.3. 

STAGE  1:  NO  REDUNDANCY 

In  a  totally  unprotected  non-reconfigurable  mode,  we  can  expect  the 
reliability  characteristics  to  be  as  shown  in  the  top  row  of  Table  6.2. 

STAGE  2:  ERROR  DETECTION  IN  MEMORY 

4 

The  most  obvious  first  step  in  applying  redundancy  for  fault  tolerance 
is  in  the  memory.  The  redundancy  is  in  the  form  of  extra  bits  in  the 
words  for  coding  as  discussed  in  Section  4.1.  At  the  lowest  level,  a 
single  byte  per  word  (parity  byte)  reduces  the  probability  that 
incorrect  data  is  able  to  corrupt  results  before  being  detected.  We 
assume  no  mechanisms  exist  for  reconfiguring  around  the  fault  or  for 
recovering  the  lost  computation. 

STAGE  3:  ERROR  DETECTION  AND  BLOCK  RECONFIGURATION  IN  MEMORY 

Memory:  9  chips  per  block:  8  information,  1  check 

384  blocks,  reconfiguration  around  faulty  blocks 
3456  chips  total 


Processor : 


2000  chips,  unreplicated. 


Errors  in  memory  are  detected  by  the  error-detecting  code.  At  the  time 
of  detection,  the  faulty  block  is  immediately  identified.  For 
convenience,  we  assume  that  the  number  of  words  per  block  corresponds  to 
the  number  of  bytes  per  LSI  memory  chip,  namely  1024.  The  state  of  the 
computation  affected  by  the  error  is  essentially  lost  unless  other 
measures  were  taken  earlier  to  establish  a  recovery  point. 

STAGE  4:  ERROR  CORRECTION  IN  MEMORY 

Memory:  10  chips  per  block:  8  information,  2  check 

Single  byte  error  correction  within  each  block 
No  reconfiguration  around  faulty  blocks 
3840  chips  total 

Processor:  2000  chips,  unreplicated. 


With  increased  redundancy  certain  error  correcting  codes  (e.g.,  Hamming, 
distance  four,  byte,  burst)  can  be  used  which  have  sufficient  data  to 
enable  correction  of  some  faults  and  the  detection  of  some  more 
extensive  faults.  These  codes  enable  the  computer  to  survive  in  the 
presence  of  some  memory  faults  thereby  increasing  the  MTBJ',  and  also 
reduce  the  probability  of  incorrect  results.  The  system  Instantly 
recovers  from  all  single  faults  in  memory. 

STAGE  5:  CODING  AND  RECONFIGURATION  IN  MEMORY 

Memory:  i0  chips  per  block:  8  information,  2  check 

Single  byte  error  correction  within  each  block 

Immediate  switchover  to  operative  block  in  response  to  failure 
3840  chips  total 

Processor :  2000  chips  unreplicated* 


Given  block  replacement  in  memory  (see  Sections  4.2.1,  4.2.4  and  4.2.5), 
a  redundancy  of  20Z  in  the  memory  reduces  the  probability  of  loss  of 
data  in  memory  to  less  than  l(f 8  /hr,  which  is  negligible  with  respect 
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to  other  fault  probabilities.  The  principal  advantage  of  Stage  5  over 
Stage  2  is  that  for  ajout  60  percent  of  all  faults  (i.e.e  those  in 
memory) ,  the  recover/  time  is  essentially  zero  because  the  combination 
of  coding  and  reconfiguration  allow  the  system  to  continue  operation 
with  only  a  small  loss  of  memory  capacity.  A  good  strategy  to  follow  is 
to  transfer  the  contents  of  a  faulty  block  to  another  block  or  to  disc 
before  another  block  error  recurs.  (In  Multics  this  transfer  is 
relatively  easy,  except  when  the  first  block  of  memory  is  affected.) 
Clearly,  such  conditions  resulting  from  chip  failures  will  be 
insignificant  compared  to  faults  due  to  other  causes  (e.g. ,  connectors, 
printed  circuit  boards,  t,ower  supplies).  The  number  of  such  components 
will  tend  to  be  roughly  proportional  to  the  number  of  chips  used,  and 
the  decrease,  because  of  the  use  of  LSI,  will  allow  the  use  of  more 
rigorous  construction  and  testing  techniques,  both  of  which  will  reduce 
the  fault  probability. 

STAGE  6:  CODING  AND  RECONFIGURATION  IN  K^iORY, 

CODING  IN  THE  PROCESSOR 

Memory :  Same  as  5 

Processor:  Unreplicated  portion — 800  chips 

Coded  portion:  1200  information  chips,  240  check  chips 
(Assume  all,  single  chip  errors  are  detected) 

2240  chips  total 

As  an  alternative  development,  we  may  apply  coding  in  the  processor 
itself.  Clearly  there  are  some  parts  of  the  processor  in  which  coding 
if-  more  easily  applied  than  in  others.  We  estimate  that  60  percent  of 
the  processor  can  be  checked  for  single  faulis  by  applying  coding  to  the 
following  types  of  units: 

Registers 

Busses 

Adders /subtracters 
Counters 
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We  further  estimate  that  ceding  in  the  processor  adds  20  percent  to  the 
chip  count  of  the  above  unit  types.  The  remaining  units  are  mainly 
concerned  with  control  rather  than  arithmetic.  We  take  as  a 
conservative  estimate  that  any  fault  in  the  noncoded  AO  percent  will 
cause  some  incorrect  results.  Because  the  coding  at  this  level  is  used 
only  for  detection,  it  does  not  improve  the  availability  but  does  reduce 
the  probability  of  incorrect  results.  It  also  shortens  the  recovery 
time  in  the  protected  portion  of  the  system,  by  providing  diagnostic 
information. 

STAGE  7:  CODING  PLUS  RECONFIGURATION  IN  MEMORY, 

CODING  OR  DUPLICATION  IN  THE  PROCESSOR 

Memory  :  Same  as  5 

Processor:  Duplicated  portion  800  +  800  chips 

Coded  portion:  same  as  6 
3040  chips  total. 


For  further  protection  against  the  posssibility  of  incorrect  results,  we 
take  Stage  6,  with  the  addition  of  duplication  (and  comparison)  of  those 
parts  of  the  processor  that  could  not  be  protected  by  coding.  This 
addition  drastically  decreases  the  error  probability,  but  with  a  slight 
reduction  in  availability. 

STAGE  8:  CODING  AND  RECONFIGURATION  IN  MEMORY, 

ERROR  CORRECTING  CODES  PLUS  TRIPLICATION  IN  PROCESSORS 

Memory :  Same  as  5 

Processor:  Triplicated  portion:  3  X  800  chips  =  2400  chips 

Coded  portion:  1200  information  chips,  600  spare  chips 
(Assume  all  single  chip  errors  are  masked) 


4200  chips  total. 
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In  this  stage,  the  repular  portion  of  the  processor  is  protected  by 
single-byte  error-correcting  codes.  The  extra  cost  here,  including  a 
high-sped  decoder,  amounts  to  a  redundancy  of  33  percent.  The  remaining 
nonregular  portion  of  the  processor  is  made  fault  tolerant  by 
triplication.  The  availability,  correctness,  and  recovery  time  should 
be  adequate  for  practically  all  time-sharing  installations.  Thus,  this 
stage  represents  the  redundancy  required  to  achive  a  high  degree  of 
fault  tolerance  in  a  simplex  system. 

STAGE  9:  FIXED  MULTICOMPUTER,  SELECTIVE  DUPLICATED  REDUNDANCY 

n  individual  computer  units,  n  =  3  -  16 , interconnected  by  a 
Communication  bus 

Each  unit  has  1.2/n  the  power  of  the  simplex  units  in  stages  1-8 

No  fault  tolerance  within  units. 


In  this  stage,  the  processing  load  is  divided  among  a  number  of 
processing  units.  We  assume  that  the  total  processing  complexity  is 
increased  by  20  percent,  due  to  the  extra  cost  of  the  communication  bus 
and  due  to  the  extra  processing  power  needed  to  counteract  the 
inefficiency  of  running  large  jobs  in  smaller  processors.  Note  that  the 
memory  is  also  divided  in  a  fixed  manner  so  that  large  and  small  jobs 
all  get  essentially  equal  portions  of  main  memory.  The  critical  portion 
of  the  operating  system  will  run  simultaneously  in  a  pair  of  computer 
.units.  The  portion  that  is  critical  is  relatively  small  -  comprising 
_  about  10  percent  of  the  system  overhead  -  as  it  comprises  the  job 
dispatching  and  error  control  procedures.  The  recovery  time  after  a 
failure  in  an  operating  system  unit  is  dependent  on  the  time  to  restart 
the  operating  system  from  a  checkpoint.  The  recovery  for  user  programs 
depends  on  the  facilities  available,  and  how  they  are  used.  Results  are 
tabulated  for  decomposition  into  3,  9,  and  16  computers. 
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STAGE  10:  FIXED  MULTICOMPUTER,  SELECTIVELY  TRIPLICATED 

This  stage  is  the  same  as  Stage  8,  except  that  the  critical  portion  of 
the  operating  system  is  run  in  a  triplicated  mode.  This  can  reduce  the 
recovery  time  after  a  fault  occurring  in  the  execution  of  the  operating 
sys  tem, 

STAGE  11:  FIXED  MULTICOMPUTER,  UNIFORM  REDUNDANCY 

Here  all  programs  are  run  simultaneously  in  two  computer  units.  Thus 
the  system  is  more  than  50  percent  redundant,  with  good  availability. 
The  recovery  time  is  uniformly  low  for  all  programs.  This  stage  is  not 
of  primary  interest  here,  but  is  of  use  in  aerospace  environments. 

STAGE  12:  RECONFIGURABLE  MULTICOMPUTER  OR  MULTIPROCESSOR, 

DYNAMICALLY  USED  DUPLICATION  FOR  THE  OPERATING  SYSTEM 

Conventional  multiprocessor  containing  n  processors  and  n  memories, 
n  =  3  -  16 

Each  processor  is  1.25/n  the  power  of  the  simplex  processor 
No  fault  tolerance  within  units 

Pair  of  processor/memory  combinations  can  be  operated  in  duplicated 
mode  for  error  detection. 


Here  a  process or /memory  combination  can  be  configured  out  of  operative 
processors  and  memory  blocks.  It  is  also  possible  for  application 
programs  to  get  variable  blocks  of  memory  by  appropriately  configuring 
the  switch.  The  switch  here  is  more  complex  than  the  cotnmzii  cation  bus 
of  Stages  9,  10  and  11,  so  that  we  assume  the  extra  processing 
complexity  required  is  about  25  percent  as  compared  with  the  simplex 
processor.  The  critical  portion  of  the  operating  system  again  runs  in 
two  computer  units.  The  critical  portion  is  larger  here  than  in  Stage 
9,  because  the  protection  mechanism  is  more  sophisticated,  and  must  be 
fault  tolerant.  Consequently,  we  assume  that  about  40K  words  of  main 
memory,  and  about  30  percent  of  the  processing  load,  are  required  for 
the  critical  portion  of  the  oprating  system.  The  primary  advantage  of 
this  stage  over  Stage  9  is  in  its  increased  availability,  because  of  the 
partitioning  into  separate  processor  and  memory  units.  The  results  for 
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this  stage  represent  the  performance  expected  for  the  architecture  of 
Section  3.3.3. 

STAGE  13:  MULTIPROCESSOR  WITH  DYNAMICALLY  USED  DUPLICATION, 

FLEXIBLE  INTERPROCESSOR  COMMUNICATION,  SELECTIVE  MEMORY  CODTNG 

Conventional  multiprocessor  containing  n  processors  and  m  memories, 
n,  m  =  3  -  16,  n  not  necessarily  equal  to  m 
Each  processor  is  1.3/n  the  power  of  the  simplex  processor 
Dynamically  modifiable  byte  error  correction  within  memory  units 
No  fault  tolerance  within  processor  units 

Pair  of  processors  can  he  ope rated  in  a  duplicated  mode  for  error 
detection. 

This  stage  differs  from  Stage  12  in  that  the  switch  can  interconnect 
among  processors,  as  well  as  between  processors  and  memories.  This 
added  flexibility  permits  two  processors  to  operate  in  a  duplicated 
mode,  without  requiring  the  cost  of  memory  duplication.  The  memories 
can  use  coding  selectively,  as  in  the  case  of  the  processors,  only  for 
the  critical  portion  of  the  operating  system.  Because  of  the  switch 
complexity,  we  assume  that  the  extra  processing  power  required  for  this 
stage  is  30  percent  of  the  simplex  processor.  The  net  effect  is  to 
increase  the  availability  as  compared  with  Stage  12. 


6.2  GENERAL-PURPOSE  BATCH  PROCESSORS 


In  general-purpose  batch  applications,  we  include  both  scientifically 
oriented  applications  and  those  concerned  with  more  commercially 
oriented  tasks.  The  fault-tolerance  requirements  of  both  differ 
slightly  from  time-shared  systems.  Principal  among  these  differences 
are  those  stemming  from  the  need  to  meet  deadlines,  and  the  extreme 
importance  in  certain  cases  of  the  need  to  protect  against  the 
possibility  of  erroneous  output.  Techniques  that  are  employed  with 
success  at  present  include  the  use  of  accounting  checks  in  commercial 
operations  to  detect  errors,  and  the  use  of  checkpoint-restart  to 
prevent  excessive  lost  processing  in  the  event  of  machine  breakdown. 
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The  most  difficult  criterion  to  meet  is  that  on  the  cost  of 
fault-tolerance  hardware.  Rarely  will  a  figure  in  excess  of  2U  percent 
be  justified  for  the  cost  of  such  hardv;are.  This  figure  explains  to  a 
certain  extent  why  such  hardware  has  been  restricted  in  existing 
systems,  frequently  being,  limited  to  parity  in  memory  and  on  data 
transfers.  However,  recent  computers  have  extended  the  protection  to 
single-error  correcting  codes  in  memory  (e.g.,  IBM  370,  Burroughs  7700), 
and  even  to  the  use  of  residue  codes  (see  Section  5.1)  m  the  arithmetic 
unit  of  the  Burroughs  7700.  Reconfiguration  in  the  event  of  faults  has 
also  been  introduced,  in  such  units  as  memory  blocks,  I/O  channels, 
power  supplies  and  peripheral  equipment. 

The  trend  of  decreasing  cost  of  electronics  (compared  with  other  costs 
such  as  manpower)  will  continue,  and  also  the  use  of  such  computers  for 
more  and  more  time-critical  calculations.  We  can  therefore  expect  to 
see  a  move  toward  computers  with  a  greater  demand  for  fault-tolerance 
than  at  present. 

The  architectures  most  suited  to  general  purpose  batch  operations  will 
probably  employ  fault-tolerance  measures  that  are  relatively  close  to 
those  used  in  time-shared  computers.  The  one  area  in  which  significant 
differences  will  be  found  is  in  the  peripheral  equipment  so  essential  to 
batch-operated  computers,  particularly  those  used  for  commercial  EDP 
operations.  In  installations  that  require  high  reliability,  present 
practice  is  to  use  a  large  number  of  each  type  of  peripheral  so  that  the 
loss  of  one  unit  causes  only  a  sfmall  decrease  in  the  throughput 
capabilities.  This  practice  is  already  successful  in  providing  the 
necessary  high  availabilty.  ( 

6.3.  COMMUNICATIONS  PROCESSORS 

i 

v  ••  * 

In  communications  processors,  we  are  concerned  with  processors  for  three 
main  functions: 

*  Message  switching,  e.g.,  the  Bell  system  ESS 
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*  Message  store  and  forward,  e.g.,  the  interface  message  processor  (IMP) 

*  Ftont-end  processing,  e.g.,  the  terminal  interface  processor  (TIP). 

These  functions  are  so  closely  related  that  a  combination  of  them  often 
coexists  in  one  computer. 

REQUIREMENTS 

A  communications  processor  is  always  part  of  a  much  larger  system.  The 
important  requirement  is  reliability  of  the  system  as  a  whole,  and  we 
therefore  expect  that  efforts  would  be  made  to  design  the  system  so  that 
faulty  processors  do  not  interrupt  service  within  the  system,,  A  fault 

in  a  processor  that  was  acting  as  a  front-end  processqr  or  as  a 
connecting  point  for  one  of  the  host  computers  of  the  system  would 
isolate  either  users  or  some  facilities  from  the  system  but  should  not 
cause  serious  degradation  of  the  remainder  of  the  system.  Such  a 
front-end  processor  should  be  at  least  as  reliable  as  the  host  computer 
attached  to  it.  Certainly,  one  order  of  magnitude  is  sufficient  for  the 
improved  reliability  over  the  host,  and  more  stringent  requirements  are 
unrealistic. 

Another  potential  reliability  requirement  is  for  the  protection  of  data. 
In  most  communications  systems,  data  protection  is  not  of  significance 
in  the  individual  communications  processors  but  should  be  achieved  at 
the  system  level,  e.g.,  using  such  techniques  as  coding  applied  to  the 
messages  to  detect  errors,  and  retransmission  by  alternative  routes  to 
achieve  error  recovery.  This  system  emphasis  has  implications  on  how 
recovery  from  faults  can  be  achieved,  in  that  it  is  not  necessary  to 
remember  the  state  of  the  processor  at  the  time  of  the  fault  in  order  to 
restart  it  after  appropriate  corrective  action  has  taken  place. 

Because  retransmission  can  be  used  to  accomplish  recovery  from  a  faulty 
message,  it  is  far  more  important  to  design  the  processors  and  other 
components  of  the  system  to  achieve  error  detection  than  to  provide 
error-correction  capabilities. 
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Communications  processors  are  often  located  at  sites  with  no  resident 
maintenance  staff.  Therefore,  the  diagnosis  and  repair  of  faults  should 
(if  possible)  be  carried  out  from  distant  points  in  the  network. 
Diagnosis  of  some  fault  conditions  is  possible,  but  many  other 
conditions  render  the  faulty  processor  incapable  of  communicating 
anything  meaningful  to  the  other  parts  c i  the  communication  network. 
Therefore  these  other  conditions  present  a  significant  problem  in 
diagnosis. 

A  RELIABLE  IMP 

To  illustrate  the  design  concepts  appropriate  for  a  communications 
processor,  consider  the  IMP  currently  used  on  the  ARPANET.  This 
processor  carries  out  all  of  the  functions  mentioned  above  (switching, 
store  and  forward,  terminal  and  host-computer  interfacing).  The  IMP 
uses  a  Honeywell  516  with  additional  electronics  principally  to 
interface  to  the  communication  equipment  and  host  computers. 

As  an  approximation  the  H516  contains  about  1600  ICs,  each  of  which 
contains,  (on  average)  about  10  gates.  Assuming  that  chip  failures  are 
a  significant  proportion  of  total  hardware  failures,  and  assuming  a 
failure  rate  of  10  per  chip  per  hour,  Ke  can  expect  a  failure  rate  of 
0.0016  per  H516  per  hour,  or  about  13  p»r  H516  per  year. 

\ 

As  of  August  1972  (see  BBN's  "Network  Summary”,  Aug  1972),  31.6  IMP 
years  had  been  logged.  With  the  above  assumption,  we  would  expect  about 
400  chip  failures.  The  number  of  unscheduled  down  times  over  this 
period  was  881.  In  resolving  these  figures  (400  and  881),  we  point  out 
that: 

*  The  881  includes  software  and  external  power-supply  failures. 

*  The  number  400  excludes  many  other  failures,  e.g.,  of  the  core 
memory,  passive  components,  and  connectors. 

*  Marginal  conditions  corrected  during  preventive  maintenance  are  not 
included  in  the  881  unscheduled  down  times, 
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As  a  conclusion,  we  regard  10  failures  per  chip  per  hour  as  a 
realistic  but  perhaps  conservative  estimate  of  chip  failure  probability. 

We  now  consider  the  design  of  a  more  reliable  processor  for  the  IMP 
environment.  We  consider  two  possible  realizations,  MSI— Medium  scale 
integration  (about  100  gates /chip)  with  a  core  memory,  and  LSI— Large 
scale  integration  with  a  semiconductor  memory.  We  reject  the  possiblity 
of  using  small  scale  integration  (SSI)  in  any  future  development. 

We  can  expect  that  with  even  an  MSI  realization,  the  number  of  chips 

required  will  be  reduced  by  a  ratio  of  about  10:1  to  approximately  160 

chips  with  an  attendant  improvement  in  reliability.  In  addition  the 

number  of  connectors  will  also  be  reduced.  We  can  expect  that  failures 
because  of  active  components  will  be  reduced  to  about  1.5  per  year  per 

IMP.  In  an  LSI  representation  the  memory  would  require  about  128  chips 

(assuming  4K  bits  per  chip  and  32K  words  of  16  bits),  and  the  processor 

about  16  chips,  resulting  in  approximately  the  same  (1.5)  number  of 

faults  per  year  in  the  active  circuits. 

Against  the  above  projected  failure  rates,  we  must  compare  failures  due 
to  non-electronic  causes,  e.g.,  city  power  failures.  These  latter 
failures  will  dominate.  It  is  therefore  our  conclusion  that  the  correct 
design  policy  is: 

*  Use  MSI  or  LSI  circuitry  whenever  possible. 

*  Maintain  message  integrity  on  a  system  basis. 

*  Maintain  system  integrity  by  re-routing  on  a  system  basis. 

*  Improve  overall  reliability,  e.g.,  by  improving  the  reliability  of 
software,  or  that  of  power  service. 

We  further  point  out  that  a  failure  rate  of  1.5  per  year  for  an  IMP-like 
processor  is  expected  to  be  far  better  than  most  of  the  host  computers 
to  which  they  are  attached. 
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As  corroboration  of  the  above  viewpoint,  we  note  that  the  conmmication 
processors  used  by  Tymshare  Inc  in  their  TYl-lSAT  system  experience  an 
average  of  1.5  failures  per  year.  There  are  93  such  processors  in  use. 
The  processors  are  Varian  620  computers  which  is  comparable  in 
capability  with  the  Honeywell  516  used  in  the  IMP. 

There  is  a  choice  of  how  much  fault  tolerance  to  nut  in  the  IMPs.  Some 
investment  in  IMP  reliability  is  worthwhile  in  light  of  the  expected 
increase  in  the  availability  of  hosts  via  the  IMPs. 

A  RELIABLE  HIGH-PERFORMANCE  COMMUNICATIONS  PROCESSOR 

As  seen  above,  a  reasonably  reliable  IMP  can  be  built  without  resorting 
to  any  special  fault- tolerance  techniques.  This  possibility  ceases  to 
exist  if  a  communication  processor  is  to  be  designed  for  significantly 
higher  performance. 

For  the  purposes  of  this  subsection,  we  consider  a  high-performance 
communications  processor  that  contains  an  order  of  magnitude  more 
components  than  the  IMP  replacement  discussed  above.  Assuming  the  same 
technology ,  this  greater  complexiity  would  increase  the  expected  number 
of  failures  per  year  from  1.5  to  15,  an  unacceptable  increase  which  must 
be  handled  by  the  use  of  fault-tolerance  techniques. 

In  addition,  we  can  envisage  an  increase  in  traffic  on  the  network.  At 
present  on  the  ARPANET,  the  traffic  load  is  small  enough  that  the 
rerouting  of  messages  can  be  used  as  a  technique  to  prevent  a  faulty  IMP 
from  affecting  other  parts  of  the  system.  As  the  traffic  load  1 

increases,  this  technique  becomes  less  viable,  and  it  becomes  necessary 
for  the  communication  processors  to  be  more  reliable. 

We  are  concerned  with  three  aspects  —  error  detection,  error 
correction,  and  processor  availability.  It  is  recommended  that  error 
detection  be  carried  out  by  coding  on  the  messages  (or  packets).  The 
overhead  associated  with  the  coding  bits  is  very  small  for  packets  of 
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the  order  of  a  thousand  bits.  Because  of  the  tendency  of  line  failures 
to  produce  bursts  of  errors,  a  burst  detection  code  should  be  used. 

Retransmission  represents  the  most  satisfactory  technique  for  error 
correction,  and  where  possible,  this  retransmission  should  be  over  an 
alternative  route.  However,  it  is  necessary  in  a  heavily  loaded  system 
that  the  number  of  such  retransmissions  be  kept  acceptably  low.  This 
constraint  implies  a  requirement  for  rapid  detection,  diagnosis,  and 
correction  of  any  fault  condition  in  any  of  the  processors. 

The  dual  requirements  of  high  performance  and  high  availability  suggest 
a  multiprocessor  or  multicomputer  approach.  Two  such  systems  exist  in 
the  design  state:  the  projected  new  high-performance  IMP  design 
(Ornstein  A2,  and  Heart  73),  and  the  PLESSEY  250  (Killians  A2)„  In  both 
cases,  high  availability  of  the  processor  is  to  be  achieved  by  switching 
out  faulty  processors  and  using  other  units  to  take  over  the  workload. 

A  problem  that  can  occur  in  such  schemes  is  that  the  unit  carrying  out 
the  disconnection  must  itself  be  very  reliable  so  that  on o  can  guarantee 
that  a  faulty  unit  cannot  corrupt  the  whole  system,  i.e.,  we  need  to 
achieve  a  significant  degree  of  fault  isolation. 

In  the  case  of  the  new  IMP  design,  the  disconnection  is  carried  out  by 
sending  a  code  word  to  the  bus  interface  of  the  bad  processor.  The  use 
of  a  code  word  rather  than  a  single  control  signal  is  intended  to 
prevent  other  processors  that  are  faulty  from  turning  off  good 
processes.  The  use  of  a  code  word,  and  therefore  the  need  to  recognize 
the  correct  code  word,  will  increase  the  complexity  of  the  logic  that 
carries  out  the  disconnection,  thereby  tending  to  make  that  logic  less 
reliable.  On  the  other  hand,  there  is  a  somewhat  better  probability 
that  a  bad  processor  will  not  accidentally  turn  off  good  processors. 
Other  schemes  to  carry  out  this  operation  have  been  investigated  and 
appear  to  have  some  merit.  Principal  among  such  schemes  is  one  used  in 
the  PRIME  system  at  Berkeley  (Borgersen  A2) .  In  that  scheme,  if  a 
processor  decides  to  turn  off  another  processor,  it  asks  a  third 
processor  to  carry  out  this  function.  The  third  processor  validates  the 
operation  before  carrying  it  out.  In  general,  therefore,  two  processors 
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must  be  at  fault  before  Incorrect  disconnection  is  carried  out.  This 

rule  Is  not  entirely  true  for  all  possible  fault  conditions  —  fori 

/ 

example,  if  the  third  processor  was  faulty  and  erroneously  believed  that 
it  had  been  told  to  turn  off  the  second  processor,  then  incorrect  1 
disconnection  would  occur,  caused  by  a  fault  in  only  one  processor.  An 
Improved  scheme  for  carrying  out  disconnection  could  use  the  combined 
logic  of  several  processors  to  turn  off  any  other  processor.  For 

^  i 

example,  to  turn  off  processor  A  would  require  processors  1,  2,  and  3  to 
disconnect  it.  To  turn  off  5,  the  logic  of  2,  3  and  A  would  be  used, 
and  so  on.  The  disconnect  function  would  Include  a  voter  from  the  three 
control  signals,  thereby  preventing  any  single  processor  at  fault  from 
being  able  to  turn  off  any  other.  Such  a  scheme  would  prevent  a  single 
faulty  processor  causing  erroneous  disco* meet ion.  Yet,  because  the 
voter  contains  significantly  less  logic  than  the  code  recognizer  of  the 
IMP  scheme,  the  immproved  scheme  would  achieve  greater  system 
reliability.  Schemes  such  as  those  discussed  above  are  all  possible  in 
the  PLESSEY  250  system,  where  such  actions  are  carried  out  by  program. 

Even  if  the  new  IMP  design  achieves  100  percent  availability,  it  will 
still  suffer  from  many  of  the  breakdown  situations  that  occur  with  the 
present  IMP.  Significant  among  these  are  software  bugs,  the  breakdown 
of  lines  between  IMPs,  the  loss  of  power  to  thr;  computer,  and  occasional 
catastrophes  such  when  the  IMP  at  Lincoln  Lab  was  affected  by  a 
lightning  strike.  The  operation  of  a  very  reliable  network  must  be 
carried  out  with  significant  management  attention  to  such  matters.  In 
the  case  of  modern  LSI  machines,  with  their  potentially  low  power  drain, 
it  is  entirely  practical  to  use  standby  power  supplies.  In  addition, 
the  processors  can  be  placed  in  a  protected  environment  to  avoid 
problems  due  to  temperature  extremes  or  other  environmental  conditions.: 

i 

Software  troubles  can  be  removed  primarily  by  increased  validation  of 
programs  before  their  use.  Such  validation  at  present  cannot  be  carried  | 
out  fully  because  the  lack  of  a  sufficiently  large  test  facility  at  Bolt* 
Beranek  and  Newman.  In  the  more  general  case  of  communication  processes 
for  other  than  the  research  comunlty  (such  as  the  present  ARPA 
network),  we  can  envisage  a  much  more  stable  operating  envlronirent  with 
fewer  program  changes.  In  that  case,  the  software  troubles  should  be 
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significantly  reduced.  Stability  of  operating  environment  is  certainly 
the  case  with  more  established  networks  —  for  example,  Tymshare's 
network,  where  software  troubles  are  negligible. 

In  examining  the  ARPA  network,  we  see  Wi'ys  in  which  the  network  can 
achieve  higher  availability  ly  the  use  of  a  few  replicated  lines.  The 
computers  on  the  network  are  mostly  in  fairly  tight  geographical 
clusters,  and  few  IMPs  are  heavily  loaded.  We  can  therefore  envisage 
the  multiple  connection  of  certain  computers  to  IMPs  as  "very  distant 
hosts".  A  particular  grouping  could,  for  example,  be  SRI,  Stanford 
University,  NASA  Ames,  and  Berkeley,  which  could  be  multiply  connected 
to  each  other's  IMPs.  Another  such  grouping  could  include  MIT,  Harvard, 
Lincoln  Lab  and  BBN.  These  connections  could  be  accomplished  in  such  a 
way  that,  if  an  IMP  were  lost,  the  hosts  attached  to  that  IMP  would 
operate  as  very  distant  hosts  of  the  other  IMPs.  This  hookup  would 
prevent  the  hosts  from  losing  their  connection  to  the  network.  This 
technique  is  not  100  percent  useful,  as  some  computers  (e.g.,  University 
of  Utah)  are  not  geographically  close  to  other  IMPs.  However,  the  total 
system  reliability  could  be  significantly  improved  at  low  cost. 

6.4.  SUPER-FAST  COMPUTERS 


Several  super-fast  computers  exist  or  are  in  development.  Notable 

f 

'  examples  are  the  CDC  STAR,  the  Texas  Instrument  ASC,  the  ILLIAC  4,  and 
the  Goodyear  STARAN.  The  structure  of  these  computers  differs 
substantially  from  conventional  computers.  In  this  subsection,  we 
examine  fault-tolerance  techniques  that  are  appropriate  to  this  class  of 
computers. 

REQUIREMENTS 

Such  computers  frequently  cost  significantly  in  excess  of  10  million 
dollars.  Backup  alternative  computers  seldom  exist,  principally  because 
few  models  of  each  computer  are  produced,  and  in  certain  cases  there  is 
only  one  in  existence.  Some  of  the  applications  for  these  computers 
have  a  demand  for  high  reliability.  An  example  of  this  case  is  the 
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control  of  a  ballistic  missile  defense  system. 

The  great  complexity  of  these  computers  plus  the  very  high  speed  of 
their  circuitry  tends  to  make  fault  diagnosis  a  very  complex  process.  , 
The  great  complexity  also  Increases  the  component  count,  each  added 
component  increasing  the  unreliability  of  the  system,  thereby  tending  to 
make  the  MTBF  much  worse.  In  the  case  of  ILLIAC  A,  the  MTRF  is 

currently  approximately  five  hours. 

'  ♦ 

»  *  » 

TECHNIQUES 

The  large  memories  associated  with  the  super  computers  tend  to  enhance 
the  system  benefit  of  memory  fault-tolerance  techniques.  Typically,  the 
memory  will  be  a  very  high  portion  of  the  total  component  count  withiii 
the  system.  The  techniques  of  coding  and  reconfiguration  as  discussed 
in  Chapter  A  and  Appendix  3  are  applicable  to  such  systems,  and  for  a 
redundancy  of  the  order  of  25  percent  can  provide  highly  reliable 
memories  whnreas  by  the  use  of  coding  alone,  the  lower  redundancy  will 
still  produce  acceptably  good  reliability.  The  only  drawback  to  the  use 
of  such  techniques  in  these  machines  is  the  fact  that  they  add  a  certain 
number  of  gate  delays  in  the  access  time  to  the  memories,  whereas  such 
computers  generally  are  designed  to  operate  the  memory  as  fast  as 
possible.  The  use  of  look-aside  pipeline  decoding  (Carter  et  al.  72b) 
prevents  the  decoding  delay  from  having  a  serious  impact  on  the 
processing  speed. 

In  the  case  of  pipelined  arithmetic  units  such  as  the  ASC  and  the  STAR 
computers,  the  pipelining  of  arithmetic  checking  by  residue  codes  or 
other  means  can  be  carried  out  in  parallel  with  the  main  processing 
pipe.  The  result  of  checking  in  parallel  is  a  very  small  delay  to  the 
arithmetic  operations,  since  the  syndrome  generation  adds  only  2  gate 
delays  to  the  length  of  the  pipe. 

Large  computers  are  frequently  used  on  calculations,  the  correctness  of 
which  can  be  verified  by  what  we  could  call  algorithmic  checking.  This 
is  the  carrying  out  of.  a  subsidiary  calculation  that  will  form  a  check 
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as  to  the  correctness  of  the  first  calculation.  An  obvious  example  is 
to  reinvert  a  matrix  after  the  original  inversion  process  to  see  if  the 
resul  -ing  matrix  is  the  same  as  the  original  (within  certain  bounds,  to 
allow  for  round  off  error).  Many  examples  of  this  type  of  checking 
exist.  We  cite  a  few  below. 


PARTIAL  DIFFERENTIAL  EQUATIONS 


Many  partial  differential  equations  can  be  checked  by  examining  the 
validity  of  the  governing  equation  at  each  point  in  the  mesh.  For 
certain  equations,  this  represents  a  task  equal  in  size  to  the  original 
solution  of  tL^j  equations.  However,  for  some  partial  differential 
equations,  such  as  boundary  value  problems,  the  checking  for  correct 
solution  is  significantly  easier  than  finding  the  solution. 


MATRIX  OPERATIONS 


The  reinversion  of  the  matrix  as  mentioned  above  provides  a  check. 
However,  this  essentially  doubles  the  total  worx.  performed.  We  can 
instead  carry  out  a  related  calculation  —  for  example,  we  can  multiply 
the  inverse  by  an  arbitrary  vector  x  yielding  a  vector  y.  By  also 
multiplying  thej  original  vector  matrix  by  y,  we  should  obtain  x,  the 
original  arbitrary  vector.  Although  this  method  is  not  100  percent 
certain  most  faults  in  a  computer  will  be  detected  in  this  manner. 


The  calculation  of  eigenvectors  and  eigenvalues  can  be  checked  by  the 
fundamental  relationship  that 


A  x  =  \  x 


In  addition,  in  certain  matrix  calculations  a  check  sum  or  several  check 
sums  can  be  carried  on  along  with  the  calculation.  In  most  methods  for 
inverting  matrices,  it  is  typical  to  compute  a  row  sum  at  each  state  of 
the  pivotal  condensation  method.  The  row  sums  provide  a  check  for  the 
remaining  calculation. 


In  summary,  where  a  regularity  in  a  mathematical  sense  exists  in  the 
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calculation,  a  simple  inverse  calculation  can  often  be  performed  which 
provides  some  capability  for  checking  the  original  calculation. 

In  the  case  of  array  computers  such  as  the  Iliac  4  and  the  STARAN 
computers,  algorithmic  checks  as  discussed  above  can  be  carried  out  by 
adding  extra  processors  that  perform  this  checking  at  all  times  in  the 
calculation.  This  would  result  in  a  certain  redundancy  of  equipment, 
but  would  speed  up  the  checking  process. 

It  must  be  pointed  out  that  the  techniques  discussed  above  are  not 
universally  applicable.  The  major  reasons  that  preclude  their  use  are 
lack  of  storage  to  retain  partial  results,  lack  of  bandwidth  to  place 
partial  results  on  a  back-up  memory,  and  the  lack  of  an  efficient 
inverse  calculation.  These  considerations  make  it  necessary  in  such 
cases  to  use  other  fault-tolerance  techniques. 

Reconfiguration  in  array  computers  is  complicated  by  the  fact  that  the 
communication  paths  between  each  processing  element  and  its  neighbors 
are  of  very  high  bandwidth  and  contain  a  large  number  of  lines. 
Therefore,  if  a  substitute  processor  is  to  be  inserted,  the  switching 
capability  has  to  be  very  large.  In  addition,  extra  gate  delays  that 
may  be  introduced  by  such  switching  capability  will  frequently  not  be 
tolerable.  In  the  case  of  these  machines,  manual  switching  of  a  new 
processing  element  into  a  mesh  of  such  elements  appears  the  most 
practical  form  for  rapid  reconfiguration  in  the  event  of  faults  in  a 
processing  element. 

Beyond  the  special  points  mentioned  above,  reliability  techniques  for 
array  processors  are  effectively  similar  to  those  for  any  other  type  of 
computer.  They  are  conditioned  by  the  fact  that,  since  the  processors 
are  so  large,  the  probability  of  failure  is  much  higher.  The  switching 
problem  in  reconfiguration  is  thus  complicated  by  the  large  number  of 
lines  of  high  bandwidth.  However,  these  disadvantages  are  of  less 
critical  importance,  because  such  machines  are  seldom  used  in  a 
real-time  on-line  mode,  and  a  few  minutes  downtime  for  manual 
reconfiguration  is  often  acceptable. 
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Super-fast  computers  dc  present  one  problem  not  always  apparent  in  other 
systems,  namely,  the  difficulty  of  checkpointing  the  total  state  of  the 
system  so  that  the  computer  can  be  switched  back  to  that  state  at  some 
future  time  for  a  calculation  to  be  restarted.  Checkpointing  can  be  a 
complex  and  time  consuming  operation  in  a  computer  th£t  is  very  large 
and  in  which  many  operations  are  carried  out  simultaneously. 


'  6.5.  AEROSPACE  COMPUTERS 

Aerospace  computer  systems  have  been  considered  in  great  detail 
elsewhere  (e.g.,  see  several  systems  in  Appendix  2).  They  are  treated 
here  partly  for  historical  reasons,  partly  because  experience  in  such 
systems  is  relevant  to  parts  of  more  general  systems,  and  partly  because 
their  redundancy  can  be  reduced  in  many  cases. 

REQUIREMENTS 

Aerospace  computers  differ  from  many  other  computers  in  several  ways. 

We  discuss  here  the  differences  in  the  requirements  for  fault-tolerance. 
For  example,  we  can  express  one  difference  in  terms  of  a  requirement 
that  the  probability  of  an  incorrect  result  being  generated  should  be 
less  than  1  in  ]00  million  per  hour  of  use.  This  is  the  relevant  figure 
for  calculations  critical  to  flight  safety  in  a  comnercial  aircraft 
(Wensley  et  al.  73).  It  translates  into  a  MTBF  of  10,000  years,  a  very 
stringent  requirement  upon  reliability. 

In  addition,  for  certain  calculations  such  as  stability  augmentation  or 
flutter  control,  the  recovery  time  must  be  exceedingly  low.  In  certain 
cases,  it  is  as  little  as  10  milliseconds. 


In  this  application  field  the  computer  is  a  very  small  proportion  of  the 
total  cost  of  a  system,  whether  a  commercial  jet  liner  or  a  space 
vehicle.  Thus,  a  level  of  redundancy  may  be  afforded  that  in  many  other 
applications  is  not  economically  practical. 
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It  is  typical  in  the  aerospace  field  that  highly  repetitive  calculations 
must  be  performed,  and  that  these  are  very  loosely  coupled.  Such 
calculations  typically  might  compute  the  numerical  solution  of 
differential  equations  that  represent  a  mathematical  analog  of  a  control 
servo.  The  iterative  nature  of  the  calculation  can  be  taken  advantage 

i 

of  by  carrying  out  the  checks  at  the  end  of  each  iteration,  rather  than 
at  the  end  of  every  small  operation  within  the  iteration. 

In  certain  aerospace  applications,  no  maintenance  is  available, 
particularly  in  the  long-life  space  missions  to  the  outer  planets.  The 
fault- tolerance  procedures  must  be  automatic  because  in  addition  to  the 
lack  of  maintenance  availability,  there  may  be  occasions  when  such  a 
space  vehicle  could  be  in  a  position  where  conmunication  with  the  earth 
was  either  not  possible  or  of  very  low  bandwidth.  In  addition  the  life 
expected  from  computers  in  such  missions  may  be  very  high,  from  five  to 
ten  years  being  entirely  possible.  Such  long  life  means  that  the 
probabilities  of  chip  failures  and  other  malfunctions  become  very  high, 
to  the  point  that  over  half  of  the  circuits  within  the  computer  may  have 
failed. 

TECHNIQUES 

! 

The  most  obvious  technique  to  use  for  both  detection  and  correction  is 
extensive  replication,  usually  triplication.  Also,  the  application  is 
well  suited  to  a  multiprocessor  organization  that  can  handle  many 
independent  processes.  The  output  of  several  identical  processing 
elements  is  compared  and  voters  attempt  to  remove  the  effect  of  one  of 
the  processors  being  in  error.  Although  coding  is  also  used  to  assist  in 
error  detection  and  correction,  coding  alone  is  not  sufficient  to 
provide  adequate  reliability  for  some  of  the  most  critical  applications. 

As  mentioned  above,  voting  may  be  carried  out  at  the  end  of  each 
iteration  of  a  repetitive  task,  as  in  the  SIFT  system  (Wensley  A2,  72), 
or  it  may  be  carried  out  upon  each  transfer  of  data  between  processor 
and  memory,  as  in  the  Hopkins  system  (Hopkins  A2)  or  in  the  Bl'CS  system 
(Wensley  et  al.  73). 
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The  reconfiguration  of  a  system  is  typically  accomplished  by  switching 
out  faulty  processors  and  switching  in  some  spare  processors  or  memory 
modules.  In  the  case  of  the  Hopkins  scheme,  a  multiplicity  of  units  is 
switched  in  and  out  whenever  a  fault  is  detected.  Two  processes  and 
three  scratchpad  memories  are  all  discarded  and  the  calculation  is 
transferred  to  another  module  of  the  same  size. 

While  the  degree  of  redundancy  that  is  acceptable  and  needed  in 
aerospace  applications  is  very  seldom  appropriate  to  large  ground  based 
systems,  the  techniques  may  be  very  usefully  applied  to  certain  small 
critical  subsystems  within  a  large  system. 


6.6.  CONCLUSIONS 


The  main  conclusions  to  be  drawn  from  our  study  of  applications  and 
architectures  for  fault  tolerance  are: 

(a)  Many  existing  computer  designs  already  incorporate  some 
fault-tolerance  techniques  which  in  some  application  fields  provide 
adequate  availability  and  guarantees  of  correctness.  Prime  examples  are 
those  systems  used  in  financial  Institutions  (banks,  stock  exchanges 
etc.,)  and  commercially  operated  service  bureaus,  with  both  batch  and 
time-shared  modes  of  operation. 

(b)  Computers  that  are  built  using  the  newer  technologies  (e.g.,  LSI) 
are  intrinsically  more  reliable,,  primarily  because  of  the  reduced  number 
of  components  and  the  attendant  reduction  in  the  number  of  such  items  as 
connectors  and  cables. 

(c)  Techniques  exist  to  provide  adequate  fault-tolerance  for  all 
application  fields.  In  most  cases,  these  techniques  are  economical, 
especially  when  compared  to  total  system  costs. 

(d)  Different  techniques  are  sometimes  necessary  for  improvement  of 
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different  fault-tolerance  parameters,  e.g.,  correctness,  availability 
recovery.  The  proper  specification  of  fault-tolerance  must  recognise 
these  different  parameters. 

(e)  The  use  of  selective  redundancy  can  be  an  effective  technique  to 
provide  greater  fault-tolerance  for  critical  system  functions  and 
smaller  redundancy  for  non-critical  programs. 


CHAPTER  7.  CONCLUSIONS  AND  RECOMMENDATIONS 


7.1.  CONCLUSIONS 


This  section  summarizes  the  main  conclusions  of  the  report. 

GENERAL  CONCLUSIONS  (See  Chapters  1,  3  and  6) 

*  Techniques  exist  for  achieving  economical  fault  tolerance  for  many 
important  applications,  without  needing  massive  redundancy.  Significant 
levels  of  correctness  and  system  availability  can  be  achieved  with 
redundancy  from  10  to  40  percent. 

*  Techniques  exist  to  provide  a  much  higher  degree  of  graceful 
degradation  than  is  currently  available. 

*  A  significant  problem  in  existing  systems  is  the  unpredictable  and 
unnecessarily  long  time  required  to  recover  after  the  occurrence  of  some 
faults.  This  problem  is  made  worse  in  most  existing  systems  by  poor 
architectural  structures  and  inadequate  diagnostic  techniques. 

*  The  degree  of  fault  tolerance  required  and  the  choice  of  techniques 
needed  to  achieve  it  are  both  strongly  dependent  on  the  environment. 

*  Software  and  operational  considerations  must  be  carefully  integrated 
with  the  hardware  in  the  design  of  a  fault-tolerant  system.  The  present 
art  of  computer  system  design  is  capable  of  such  integration,  if 
properly  motivated  by  managment  directives. 

The  following  dis  ussion  concerns  some  of  the  specific  techniques  for 
fault  tolerance.  Some  of  these  are  readily  available,  while  others  are 
capable  of  being  developed. 

ARCHITECTURAL  CONSIDERATIONS  (See  Sections  3.2,  3.3,  Chapter  b) 

*  Simplex  systems  are  adequate  in  some  cases. 
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*  Reconf igurable  multiprocessors  are  desirable  for  high  availability  and 
graceful  degradation. 

*  Good  system  structuring  is  highly  beneficial  throughout  system 
development. 

*  System  security  is  strongly  related  to  fault  tolerance.  Protection 
mechanisms  are  critical  to  some  uses  of  multiprocessor  architectures. 

PROCESSOR  CONSIDERATIONS  (See  Chapters  3,  5  and  6) 

*  In  most  systems,  dynamically  selective  replication  of  critical 

processing  capability  may  be  used  without  greatly  affecting  the  overall 
cost.  ' 

i 

M  >■ 

*  Deferred  detection,  interspersed  on-line  diagnostics,  and  automatic 

recovery  strategies  are  useful  in  reducing  redundancy  when  time  is  not 
critical. 

*  Error  detection  (or  correction)  in  arithmetic  can  be  achieved  with 
codes  also  achieving  error  detection  (or  correction)  in  memory  (see 
below),  at  almost  the  same  cost  as  the  best  codes  for  memory  alone. 

Byte  coding  is  suitable  for  LSI  arithmetic. 

*  For  certain  processing  functions,  increased  dependence  on  memory 
(e.g.,  by  table  driving)  is  very  effective,  since  it  allows  economical 
use  of  redundancy.  Distributed  logic— in— memory  designs  are  interesting 
in  certain  cases. 

*  The  use  of  read-only  memories  with  coding  can  be  highly  effective  for 
reliable  logic. 

MEMORY  CONSIDERATIONS  (See  Section  3.1  and  Chapter  4) 

*  Fault  tolerance  is  more  economical  in  memory  units  than  in  other  parts 
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of  a  computer  system.  Performing  functions  in  memory  that  are  normally 
done  in  logic  (e.g.,  via  table-driving)  permits  economical  fault 
tolerance. 

*  Coding,  byte  slicing,  page  relocation,  and  memory  reconfiguration  are 
appropriate  for  fault-tolerant  memories. 

*  Byte  slicing  and  byte  coding  are  particularly  appropriate  for  LSI 
memories  that  store  several  bit  positions  on  a  chip.  Byte  coding 
requires  only  one  redundant  byte  per  word  for  detection  of  arbitrary 
errors  within  any  byte  of  the  word,  and  a  logarithmically  increasing 
cost  for  byte  error  correction.  The  increase  in  thf;  overall  cost  due  to 
encoding  and  decoding  is  negligible  (except  for  very  small  memories). 

*  No  delay  is  required  for  decoding  in  the  absence  of  errors  whenever 
erior  detection  (syndrome  generation)  can  be  overlapped  with  execution 
in  an  automatic  instruction  retry  environment. 

*  Reconfiguration  around  faulty  memory  components  is  simple  and  highly 
effective.  Reconfiguration  at;  the  block  level  is  aided  by  page 
relocation  in  hardware.  A  virtual  memory  organization  in  hardware  can 
offer  further  benefits  for  fault  tolerance.  For  certain 
high-availability  and  high-reliability  requirements,  replacement  by 
switching  at  the  chip  level  is  appropriate  in  combination  with  byte 
coding. 

TECHNOLOGICAL  CONSIDERATIONS  (See  Chapters  3, A, 5, 6) 

*  Newer  technologies  permit  certain  techniques  for  fault  tolerance  to  be 
practical.  However  they  do  not  supplant  the  need  for  architectural 
fault  tolerance. 


*  LSI  outmodes  many  of  the  techniques  for  handling  single  faults  and 
single-bit  errors.  Correlated  faults  must  be  considered. 
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7.2.  RECOMMENDATIONS  FOR  FUTURE  RESEARCH  AND  DEVELOPMENT 


Throughout  this  report  are  conclusions  with  implications  for  future 
research  and  development.  Our  recommendations  for  future  research  and 
development  are  summarized  here,  and  are  classified  according  to 
detection  and  diagnosis,  architecture,  and  analysis. 

DETECTION  AND  DIAGNOSIS 

Most  existing  fault-tolerant  systems  use  primitive  techniques  for  error 
detection  (e.g.,  replication  of  processors,  coding  within  memory).  We 
remain  convinced  that  more  economical  methods  exist,  such  as  using 
probabilistic  and  deferred  error  detection,  which,  for  example,  take 
advantage  of  knowledge  about  existing  permanent  faults.  Feedback  error 
detection  is  also  possible.  Models  are  needed  that  permit  a  theoretical 
study  of  the  time-space  trade-offs  in  fault-tolerant  systems. 

Programmed  consistency  checks  are  a  powerful  error-detection  technique 
for  certain  types  of  computations  —  notably  those  involving  servo- type 
control  or  those  with  a  readily  computed  inverse.  We  believe  that  a 
much  broader  class  of  programs  is  suited  to  such  checks.  The  use  of 
run-time  assertions  (e.g.,  similar  to  in  nature,  but  not  as  complete  as, 
Floyd  assertions)  appears  to  be  very  promising. 

Periodic  self-diagnosis  is  important  as  a  means  for  fault  detection  and 
also  as  a  means  for  reducing  needs  for  preventive  maintenance  and 
eliminating  the  need  for  emergency  maintenance.  Good  algorithms  now 
exist  for  specifying  test  sequences  for  combinational  networks  when  the 
faults  are  simple,  e.g.,  gate  outputs  being  stuck  at  0  or  1,  but  rot  for 
more  realistic  faults.  The  sequential  case  is  not  at  all  well 
understood.  Very  little  has  been  done  on  the  problem  of  general  methods 
for  diagnosing  large  systems  so  as  to  pinpoint  a  faulty  module.  We  feel 
that  these  problems  are  all  soluble  if  specific  structures  (say, 
distributed  two-dimensional  networks)  are  considered,  or  if  redundancy 
is  permitted  within  the  logic  to  enhance  diagnosability. 
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A  serious  weakness  in  the  current  art  is  the  absence  of  a  design 
methodology  that  integrates  hardware  and  software  into  a  systems  concept 
addressing  reliability,  availability,  security,  efficiency,  and 
functional  capability  in  a  unified  way.  For  example,  significant 
benefits  can  be  expected  from  techniques  for  structured  design  and 
implementation  (see  Section  3.2).  Such  a  methodology  requires 
significant  communication  and  cooperation  among  research  and  development 
people,  among  hardware  and  software  people,  and  among  university  and 
industrial  people.  (The  ARPA  Network  is  providing  some  steps  in  this 
direction.) 

There  is  a  need  to  devalop  economical  architectures  for  fault  tolerance 
in  a  general-purpose  environment.  (The  aerospace  and  telecommunications 
applications  and  specialized  minicomputers  have  received  most  of  the 
attention  to  date.)  In  particular,  the  multiprocessor  outlined  in 
Section  3.3.3  is  an  attractive  possibility,  with  selective  replication 
in  time  and  space.  An  operating  system  for  this  architecture  is  also 
worth  investigation.  There  is  also  a  need  for  an  economical  solution  to 
the  protection  problem  in  a  large  dependent-processor  multiprocessor 
s;/s  tern. 

Possibilities  for  fault  tolerance  should  also  be  exploited  via  novel 
architectures,  including  highly  reconfigurable  distributed 
micro-processor  arrays  and  networks  of  larger  computers.  An  important 
direction  for  future  systems  is  the  achievement  of  smoothly  degradable 
economical  systems  with  rapid  recovery  from  faults.  The  scheme  for 
reconfigurable  memory  arrays  of  Section  4.2  represents  a  possible 
starting  point  for  siich  systems. 

i 

ANALYSIS 

There  remains  a  difficult  problem  of  analyzing  the  reliability  of  a 
redundant  system  or  even  proving  that  it  is,  say,  single-fault  tolerant. 
The  difficulty  is  greatly  reduced  by  structured  design  and  by  proofs 


that  the  executive  can  reconfigure  the  system  as  intended.  This  issue 
is  no  different  from  proving  the  correctness  of  an  operating  system  —  a 
process  considerably  simplified  by  structured  design.  However,  the 
modeling  of  complex  fault-*tolerant  systems  is  also  important  here  —  an 
issue  frequently  studied,  but  still  not  adequately  resolved. 

An  important  quantitative  measure  of  a  fault-tolerant  system  is  the 
relative  cost  of  fault  tolerance,  e.g.,  the  redundancy.  Except  when 
trivial  techniques  are  used,  it  is  difficult  to  estimate  the  redundancy 
accurately.  In  this  report  we  associate  the  various  redundancy 
techniques  with  different  types  of  architecture.  More  generally,  it 
would  be  useful  to  have  measures  of  the  total  redundancy,  e.g.,  as  a 
function  of  availability,  reliability,  and  down-time. 

In  summary,  the  state  of  the  art  leads  to  considerable  hope  for  the 
development  of  economical  fault-tolerant  systems.  However,  there  is 
still  much  need  —  and  fortunately,  much  room  —  for  advancement  in  the 
state  of  the  art. 
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APPENDIX  1 

CENSUS  OF  FAULT-TOLERANT  COMPUTING  SYSTEMS 

This  is  a  brief  summary  of  systems  and  system  designs  providing 
significant  fault-tolerance  and/or  availability.  Those  systems 
indicated  by  "(A2)"  are  considered  in  greater  detail  in  the  Survey  of 
Fault  Tolerant  Computing  Systems  (Appendix  2),  where  references  are 
included.  Terse  references  are  given  here.  Several  systems  are 
described  in  what  is  referred  to  here  as  the  "Intermetrics  Report"  (J. 

S.  Miller  et  al.,  Multiprocessor  Computer  Study,  Final  Report,  Contract 
NAS  9-9763,  Intermetrics,  Inc.,  Cambridge,  Mass,  March,  1970). 
Abbreviations:  P=Processor,  M=Memory ,  (S)EC=(single)  error  correction, 
(D)ED= (double)  error  detection.  A  measure  of  the  hardware  overhead  for 
fault  tolerance  is  given  as  that  percent  of  all  hardware  dedicated  to 
fa  alt- tolerance  (on  an  approximate  cost  basis). 

A.  GENERAL-PURPOSE  COMPUTING  UTILITIES,  generally  good  availability, 
human  users,  modest  reliability,  maintenance  permitted. 

I(A2).  Multics,  MIT  (F.  J.  Corbato)  and  Honeywell,  Cambridge,  Mass; 
ARPA-funded  development,  now  Honeywell  product.  See  E.  I.  Organick,  The 
Multics  System,  Mil  Press  1972. 

*  General-purpose  computing  utility  (time-sharing,  batch),  with  high 
availability  and  file  integrity.  Four  installations  currently  exist. 

*  1-7  P  (Honeywell  6180s),  typically  2P,  multiprocessed  multiprogramming 
totally  reentrant  procedure,  virtual  memory,  manual  reconfiguration  of 
multiple  P  and  M  during  operation,  extensive  isolation  via  the  ring 
mechanism  for  protection  and  via  file  system  access  control,  incremental 
file  backup,  variable-depth  system  recovery,  redundancy  in  the  file 
directory  structure.  SED  in  mr*n  memory.  Significant  security. 

Hardware  negligibly  redundant.  Software  variably  redundant,  e.g.,  20% 
overhead  in  time  for  guaranteed  30-minute  lag  backup. 
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2(A2).  PRIME,  University  of  California  at  Berkeley  <H.  Baskin);  ARPA. 

*  Reliable,  secure,  modest  computer  utility,  high  availability.  In 
development. 

*  5  P  (design  practical  for  3  P  to  8  P),  with  highly  restricted  possible 
connectivity  among  M,  p  and  disk,  strict  isolation  with  no  memory 
sharing  or  multiprogramming,  "spontaneous"  reconfiguration  via  a 
reliable  self-checking  switch.  Hardware  less  than  10%  redundant, 
software  less  than  10%  redundant  in  time. 

3(A2).  Carnegie-Mellon  University;  ARPA. 

*  Research  system  development  with  applications  to  ARPA  speech 
understanding  project;  in  design.  2x2  version  exists. 

*  16  P  x  16  M  (modified  POP  11s),  with  reliable  crossbar  switch.  Hard 
and  soft  reconfigurability,  with  widely  varying  operating  modes. 

Hardware  less  than  5%  redundant. 

4.  University  of  Newcastle-on-Tyne,  Engl.;  Scientific  Research  Council. 

*  General  computing;  in  design 

*  PUP  11s 

Note.  Burroughs  B7700  and  IBM  System/370  have  significant  hardware 
facilities  for  fault  tolerance.  Also,  various  commercial  time  sharing 
services  gain  availability  (hut  not  necessarily  reliability)  by  having 
multiple  P,  M  and  secondary  memory  units  cross-switchable. 


B.  GROUND-BASED  SPECIAL  PURPOSE  SYSTEMS,  controlling  the  environment  (or 
vice  versa),  generally  higher  reliability  and  availability,  often 
tighter  real-time  constraints  than  those  above,  usually  maintainable. 

5(A2).  ESS  (Electronic  Switching  Systems),  Bell  Labs,  Naperville.  Ill. 

*  Telephone  switching  system;  long-term  continuous  system  availability 
With  occasional  errors  supposedly  tolerable  to  customers.  Over  200 
Number  1  ESS  in  operation,  many  more  Number  2  ESS,  TSPS. 
k  2  P  (1  functional,  1  standby  checking  and  diagnosis),  automatic 
reconfiguration.  Separate  nonalterable  program  store  with  SEC.  50%  of 
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all  programs  are  diagnostics.  Millions  of  hours  of  experience  have 
aided  in  improving  hardware  and  software  reliability.  People  problems 
still  difficult  (operations,  maintenance).  About  50%  redundant  in 
hardware.  Storage  for  software  due  to  fault  tolerance  also  significant 
—  half  of  all  programs. 

6(A2).  PLESSEY  System  250,  The  Plessey  Co.,  Ltd.,  Taplow  England. 

*  Telephone  and  data  switching,  long-term  continuous  availability, 
modular  expandability.  Prototype  end  of  1971. 

*  1-16  P,  1-30  M,  each  16-64K.  Multiprocessing,  multiprogramming, 
virtual  memory,  totally  icentrant,  capability-based  protection  and 
sharing.  Continued  operation  via  reconfigurability  with  everything 
multiply  available.  Extensive  hardware  fault  detection,  operating 
system  consistency  checks,  background  test  routines.  Hierarchical 
software  recovery.  Hardware  20-50%  redundant,  depending  on  use, 

7(A2).  High-speed  modular  interface  message  processor  (IMP)  for  the 
ARPANET,  Bolt  Beranek  &  Newman,  Cambridge,  Mass. 

*  Store  and  foreward  for  interhost  message  switching.  High 
availability.  Reliability  largely  left  to  hosts. 

*  1-14  P  initially,  each  with  4K  M.  Smoothly  degradable,  e.g. ,  in  2  P 
units.  Distributed  power,  cooling. 

8(A2),  CLC,  Bell  Labs,  Whippany  NJ ;  ABMDA  (Safeguard) 

*  Safeguard  missile  defense;  continuous  availability  when  (and  if) 
required.  In  development  since  mid-60s. 

*  Up  to  10  P ,  multiprocessed,  on-line  sparing,  separate  program  memory 
not  writeable;  program  retry;  ED  via  four-bit  check  on  64-bit  words. 

9.  FAA  (Federal  Aviation  Adm.),  IBM.  See  IBM  Sys  J.,  vol  6,  no  2,  1967. 

*  Air  traffic  control,  long-term  continuous  availability.  Untolerated 
nontransient  errors  can  be  disastrous.  About  20  systems  at  ATC  centers 
covering  the  continental  United  States. 

*  Up  to  4  P  (IBM  9020),  up  to  12  M.  Program-controlled  error  analysis 
and  reconfiguration,  gracefully  deconfigurabie.  5-second  battery  backup 
power  supply.  Relies  heavily  on  good  available  field  engineers. 
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10.  Flight  Plan  Processing  System,  Marconi  Radar  Systems  Ltd., 
Chelmsford,  England. 

*  Real-time  air-traffic  control.  At  most  one  30-sec  interrupt  per  year, 
at  most  one  longer  interruption  in  5  years,  inrnune  to  power  failures, 
fast  repair  of  faulty  equipment. 

*  3  P  (MYRIAD) 

11.  MDS-2  (Market  Data  System),  New  York  Stock  Exchange 

Stock  trading  ticker  control.  Near- continuous  availability,  no 
transaction  losses  permitted.  Operational  Aug.^t  1972.  Precursor  MDS-1 
operational  for  7  years. 

*  3  P  (360/50),  2  multiprocessing  with  shared  K  &  LCS  (but  1  P  basically 
monitoring),  3rd  P  normally  spare  (running  background  jobs),  extensive 
program  checking.  Highly  replicated  peripherals  (I/O,  disks,  etc.) 

12 (A2).  COMEX,  Pacific  Coast  Stock  Exchange 

*  Stock  trading  message  switching;  near-continuous  availability,  no 
transaction  losses  permitted,  small  real-time  lag  permitted. 

Operational  since  1969. 

*  2  complete  systems  (each  has  360/50  plus  2  POP  8s),  one  in  San 
Francisco,  one  in  Los  Angeles,  capable  of  running  separately  or 
cross-switched  (interconfigurable) . 

13.  NASDAQ,  National  Association  of  Securites  Dealers  Automated 
Quotations;  See  Datamation,  March  1972,  pp.  42-45. 

*  On-line  interactive  system  to  facilitate  trading  of  OTC  securities; 
high  availability;  operational  since  end  of  1971. 

*  2  P  (1108s).  multiprocessing  under  EXEC  8,  capable  of  running  simplex. 
Dual  records  in  file  structure,  automatic  recovery  techniques. 

14.  Standard  Telecommunications  Lab,  Harlow,  England.  See  Electrical 
Review,  6  Feb  1970,  pp.  1-3. 

*  Real-time  control 

*  1  P,  SEC/DED  in  M,  in  transfers,  and  in  1-0;  duplication  of 
punch/reader  and  of  M  access  switches;  triplication  of  control  and  of 
function  unit.  52%  of  hardware  cue  to  fault  tolerance. 


15.  Foxboro  88,  Foxboro  Corp.  Process  control  using  2  P  (PDP  8) 


C.  AERO-SPACE  SYSTEMS,  usually  with  ultra-high  reliability  and 
availability  requirements,  usually  critical  real-time  contraints,  human 
maintenance  usually  not  possible.  At  least  the  first  four  efforts  have 
resulted  in  prototype  systems.  The  remaining  efforts  represent  mostly 
designs  in  various  stages  of  completion. 


16 (A2).  JPL-STAR,  JPL,  Pasadena  Cal  (A.  Avizienis) ;  NASA 

*  Unmanned  outer-space  travel  computer,  long-life  availability  without 
maintenance.  Prototype  in  operation  since  1969. 

*  1  P  (uniprocessing) ,  heavy  use  of  coding  (residue  checking  for  SED  in 
memory  and  arithmetic,  ED  in  op  codes),  duplicated  logic  operations, 
triplicated  monitoring  and  control  (TARP  *  test  and  repair  processor), 
replacement  by  spares  via  power  switching.  User-provided  rollback 
points.  60%  of  hardware  due  to  fault  tolerance. 


17(A2).  MECRA,  Electronique  Marcel  Dassault,  St.  Cloud,  France;  DRME 

*  General-purpose  design  for  special-purpose  applications,  including 
aerospace.  Prototype  now  working. 

*  Duplex  arithmetic,  Hamming  code  (7, A)  as  DED  on  coded  decimal 
representations  (with  six  unused  combinations),  sparing, 
microprogrammable  reconfiguration.  About  66%  redundant. 


18(A2).  ACGN,  CER8ERUS,  etc.,  MIT  Draper  Lab,  Cambridge,  Mass  (A.L. 
Hopkins,  Jr.);  NASA/MSC. 

*  Apollo  manned  space  on-board  control,  very  high  reliability  during  the 
mission  without  maintenance.  Prototype  exists. 

*  At  least  1  processing  unit  (up  to  6),  multiprocessing  among  processing 
units,  replication  within  each  processing  unit  and  within  memories 

l 

(without  coding).  Two  concepts: 

(a)  duplexed  processing  units,  triplexed  scratchpad  memories,  triplexed 
memories  and  buses,  with  spares; 

(b)  triplexed  processing-scratchpad  units. 

About  80%  redundant. 
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19(A2).  MDC  (Modular  digital  computer),  IBM  Yorktown  Hts,  NY,; 
NASA-Huntsville . 

*  Modular  system,  wide  range  of  high-reliability  applications;  design 
only 

*  m  P,  multiprocessing  and  replication  as  well.  FO-FO-FS 
(laiJ-“°Per3tional  on  first  and  second  faults,  fail-safe  on  the  third)  in 
4  P  fault— tolerant  mode,  detection  mode  also  possible.  M^crodiagnostics, 
b- adjacent  multiple  errors  handled  in  M,  extensive  self-checking. 

20(A2).  MSC  (Modular  spacecraft  computer),  Ultrasystems  (Newport  Beach 
Ca)  and  Raytheon  (Waltham  Ma) ;  SAMSO/SYT  (Los  Angeles  Ca) 

*  Re configurable  guidance  and  control,  space  shuttle  use;  long-life 
reliability. 

*  The  Raytheon  entry  in  this  effort  has  1  P,  identical  subP  and  subM 
reliably  switchable  with  sparing.  SEC  in  M  plus  3  spare  bits  reliably 
switchable  via  "rippler",  burst-error  detection  in  mass  M,  triplicated 
control,  duplicated  configuration  control. 

*  The  Ultrasys terns  entry  is  similar  to  the  JPL  STAR. 

21  (A2).  SIFT,  Software  implemented  fault  tolerance,  SRI  (John  Wens  .ley )  ; 
NASA-Langley 

*  Airborne  control  (commercial  aviation);  availability  of  correct 
results  during  flight;  some  tasks  more  critical  than  others,  permitting 
slight  degradation  of  less  critical  tasks.  Design  only  (see  1972  FJCC) . 

*  Multiprocessing  with  variable  software  replication,  dependent  on 
application  program  (software  reconfigurable) .  Fault  tolerance  via 
software  can  avoid  special  hardware,  permits  use  of  existing  designs. 
Connectivity  is  restricted:  P  can  modify  only  its  own  M,  can  read 
others,  limits  fault  propagation.  Executive  uses  the  same  fault- 
tolerance  procedures  as  application  programs.  About  75%  redundant. 

/ 

22(A2) .  ARMMS,  Hughes,  Fullerton  CA  (W.  L.  Martin);  NASA-Marshall  (MSFC) 

*  Spaceborne  control;  long-life  reliability 

*  m  P,  dynamically  reconfigurable,  e.g.,  as  independent-process 
multiprocessing  or  as  replication  with  sparing.  20%-80%  redundant 
(variable) 
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23(A2).  Intermetrics  multiprocessor,  Cambridge  Mass  (J.  U.  Miller), 
outgrowth  of  EXAM;  NASA-ERC  (Houston) 

*  Manned  orbiting  space  station 

*  m  P  (1  to  8,  nominally  3),  each  P  internally  duplicated,  coding  in  M 
(ED),  capability  for  dynamic  duplication  of  critical  data  words, 
buffered  instruction  retry,  save  within  interrupted  instruction, 

24(A2).  Autonetics  (N.  Am.  Rockwell,  Anaheim,  L,  J.  Kcczela);  NASA-MSC 

*  Space  shuttle;  long-life  reliability 

*  4-level  redundancy  FO-FO— FS  (cf.  MDC)  80%  redundant;  less  for  lower 
fault  tolerance, 

25.  SIRU  (Strapped-down  inertial  reference  unit),  MIT  Draper  Lab  (A.  L. 
Hopkins,  Jr.).  See  Intermetrics  Report  (reference  above). 

*  Apollo  guidance.  Simple  prototype  built. 

*  2  P  (1  as  standby),  M  duplicated. 

26.  MULTIPAC,  General  Telephone  and  Electr. ,  Waltham,  Mass;  NASA-Ames. 
See  IEEE  Trans.  Aerospace  and  Electronic  Sys.,  Sept.  1971,  pp.  974-981. 

*  Data  handling  for  deep-space  probes.  Long  life,  but  arbitrary  outages 
can  usually  be  tolerated.  Design  only. 

*  Up  to  5  P,  15  M  (4  K  each),  gracefully  degradable  to  1  P,  1  M. 

Manual  reconfiguration  of  software  and  hardware  via  ground-based 
diagnosis,  reprogramming,  reassembly  and  transmittal  of  a  new  system 
into  space.  Maintainable  despite  wide  range  of  problems. 

27.  BUCS  (bus  checker  system),  SRI  (Karl  Levitt),  NASA-Langley . 

See  SRI  Final  Report,  NASl-10920,  1973. 

*  Aircraft  control,  as  in  SIFT 

*  5-10  (local)  P  &  M  units,  each  duplicated  internally,  byte  coding  in 
central  M,  bus  checker  coordinates  restart  mechanism,  periodic  diagnoses 
of  M  and  of  unflexed  processor  functions.  About  33%  redundant. 

28.  TOPS,  JPL  (Gilley).  See  IEEE  Trans.  Astr-Aero,  Sept.  1970. 

*  Thermo-electric  outerplanet  space  travel 

*  Related  to  JPL-STAR. 
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29.  MFC,  Hamilton-Standard;  NASA-ERC.  See  Intermetrics  Report. 

*  Modular  flight  computer 

*  3  P,  3  M,  cross-configurable ,  TMR  or  3  P  multiprocessor 

30.  ALPHA,  CDC.  See  Intermetrics  Report. 

31.  AADC,  Honeywell;  NASA,  AADC  Naval  Air  Systems  Consnand.  See 
Intermetrics  Report. 

32.  IRAD,  Litton.  See  Intermetrics  Repo  t. 

33.  SDC-Burroughs ;  USAF-Wright-Patterson,  Multiprocessor 

34.  S-3,  Univac 

35.  SUMC,  RCA  Advanced  Technology  Lab,  Camden  NJ;  NASA  Huntsville. 
*Space  ultra-reliable  modular  computer,  COSMOS  technology. 
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APPENDIX  2 

SURVEY  OF  FAULT-TOLERANT  COMPUTING  SYSTEMS 


This  appendix  presents  replies  to  a  questionaire  sent  to  architects  of 
various  fault-tolerant  computing  systems.  It  is  hbped  that  the 
questionaire  will  itself  be  useful  as  a  descriptive  form  and  that  the 
replies  will  aid  in  understanding  and  comparing  the  systems  included 
here.  To  this  end  the  questionaire  has  been  desipned  to  permit  a 
concise  description  of  each  system,  its  poals,  its  motivations,  its 
principles ,  its  structure,  its  techniques,  and  its  achievements  to  date. 

The  replies  piven  here  are  included  essentially  in  their  entirety. 
Several  significant  efforts  are  unfortunately  not  represented  here 
e.p.,  IBM  s  FAA  system,  the  New  York  Stock  Exchange  System  MDS-2,  and  a 
system  under  development  at  the  University  of  Newcastle-on-Tyne. 

The  first  issue  of  this  survey  was  distributed  informally  to  conference 
participants  at  the  Second  International  Symposium  on  Fault-Tolerant 
Computing,  Boston,  June  19-21,(  1972.  It  supported  the  panel  discussion 

Approaches  to  the  Architecture  of  Fault-Tolerant  Computing",  chaired  by 
Jack  Goldberg.  ’  J 


The  contents  of  this  appendix  are  as  follows. 


Questionnaire 
Replies : 

A«  Avizienis,  JPL  and  UCLA 
li.  R.  Borperson,  U.  C.  Berkeley 


page  A2.2 


Carter,  IBM,  Yorktfiwn  Heights,  NY 


W.  C. 

J.  L,  Delamare,  EMD,  St, -Cloud,  France 
Capt.  L.  A,  Fry,  SAMSO,  Los  Angeles,  CA 
Hopkins,  Jr.,  MIT  Draper  Lab 
Koczela,  No<rth-American  Rockwell 
Martin,  Hughes  Aixcr.,  Fullerton  CA 
Miller,  In rermetrics ,  Cambridge  MA 
Ornstein,  lolt  Beranek  &  Newman 
Ridpway  III,  Bell  Labs,  Madison  NJ 
Saltzer,  MIT  Project  MAC 


A. 

L. 

W. 

J. 

s. 

w. 


L. 

J. 

L. 
S. 

M. 
C. 
H. 


J. 

U»  Siewiorek,  Cavaepie-Mellon  Univ, 

W.  Ulrich,  Bell  Labs,  Naperville,  Ill. 

D,  C,  Wallace,  SRI  for  PC  Stock  Exchange 

J.  H,  Wens  ley,  SRI 

R.  K,  Williams,  Pless'iy,  England 


;  A2.pp: 

System: 

4-6 

JPL-STAR 

8-10 

PRIME 

12-13 

MDC 

6-7 

MECRA 

10 

MSC 

14-15 

ACGN,  etc. 

3 

OFT) 

16-17 

ARMMS 

18-19 

(mp) 

11 

HSM  IMP 

20-21 

Safeguard 

22-23 

Multics 

26-27 

C.mmp 

23-25 

No.  1  ESS 

28-29 

COMEX 

30-31 

SIFT 

32-34 

System  250 
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SURVEY  OF  FAEl.T-rULRAST  COM PI TINT  SYSTEMS — i)l  I  ST10SNA1  RL 
SRI  Computer  Selene*  Croup,  June  1972 

1.  IDENTIFICATION  of  the  ayster 
1  1.  NAME  i  What  la  the  relevant  n»N  of  the  avstem? 

1.2.  RESPONSIBILITY:  What  la  the  responsible  organisation? 

1.3.  Sl*PPORT ?  What  are  the  source*  of  support1 

1.4.  PARTICIPANTS:  Who  (and  what  orp anl zatlona ,  if 
relevant)  ate  the  principal  partlclpanta? 

1.5.  START:  What  waa  the  date  of  conception? 

1.6.  COKPLfTION:  What  waa,  or  la  expected  to  he,  the 
completion  date?  (Specify  prototype  acceptance  date,  or 
dealpn  completion  date  If  design  only.) 

1.7.  BIBLIOGRAPHY :  Whet  are  the  most  relevant  references? 


2,  MOTIVATION  for  the  ayatem 

2.1.  Pl'RPOSL:  What  la  the  rain  purple  of  the  avatee 
(e.p.,  peneral-purpoae  computing,  real-tire  air-traffic 
control,  atore-and-forward) 7 

2.2.  PHYSICAL  EHVi Rf'NKENT :  Where  doea  the  avatem  operate 
(e.p.,  pround-haaed,  airborne,  spaceborne)? 

2.3.  COITPETiNC  ENVIRONMENT:  How  doea  the  ayatea  relate 
computational lv  to  lta  eovlronrent  (e.p.,  locally, 
rerotely,  via  a  network.  Interactively,  via  peripheral*, 
with  human  users)7 

2.4.  fOHPlTiyr  OBJECTIVES:  What  are  the  apeclflc  corputlop 
oh  fectl vea ,  repardlnp  capability,  capacity,  performance 
(throuphput  or  reaponae),  coof 1 puratlon  acaleahlll ty , 
maxi  ruii  real-time  del  ay  a,  etc.  (as  relevant)? 

2.5.  RFl.iABIl.irv  OBJLCT1VFS:  uq.at  are  the  specific  svater 
reliability  objectives,  with  respect  to  dealred 
availabllltv  durlnr  what  period,  mlnleum  time  to  system 

f  el  litre,  maximum  permitted  duration  of  outape,  etc.’ 

2.6.  DYNAMIC  VARIABILITY:  How  ray  theae  objectives  vary 
dnrlnp  operation?  (F.p.,  how  ray  performance  deprade? 

May  performance  be  exchanred  for  increased  reliability7) 

7.7.  PENALTIES?  What  ere  the  renaltlea  arlalnr  from 
faulty  operation?  (Poaalble  example*  Include  loas  of 
life,  hadlv  decreaaed  perforxuoce,  the  neceaaltv  of  manual 
Intervention,  loaa  of  revenue,  etc.) 

2,R.  CONSTRAINTS:  What  explicit  phyalcal  constraint*  exist 
(e.r.,  with  reapect  to  site,  welpht,  power,  cost)? 

2.9,  TRADEOFFS:  What  critical  tradeoffs  exlat  among  the 
objectives? 


3,  DESCRIPTION  of  the  avatee 

3.1.  ARCHinXTVRK 

3.1.1.  CONF ITERATIONS 

3. 1.1.1.  INTFRCONNFCTIV1TY:  What  is  the  basic  conflpura- 
tion,  and  what  restriction*  exlat  on  lnterconnectivltv7 
(You  rav  chooaa  to  Include  a  hlock  diagram,  a  PMS  diaprar 
a  la  Bell  and  Newell,  or  other  useful  represent  at  Ion. ) 

3. 1.1.2.  RANCE:  What  ia  the  ranpe  over  which  config¬ 
urations  are  sensible  (minimum  to  eaxlrme),  e.p.,  how  many 
processors,  how  manv  memory  modules  (of  what  site  and  word 
length,  and  with  what  restriction!  If  any),  etc.7 

3. 1.1.3.  CAPABILITY:  What  Is  the  effective  computlnp  power 
of  the  smallest  sensible  configuration  In  3.1. 1.27  Please 
compare  It  roughly  with  a  well-known  system  (e.p.,  360/40, 
65,  195),  and  cite  a  hall-park  f 1  pure  for  the  number  of 
additions  per  second.  Capability  required  for  fault- 
tolerance  should  oot  be  Included. 

3.1.2.  FXFCUT1VF  and  operetlnp  ayatem 

3. 1.2.1.  MODES  of.  operation:  How  doea  the  ayatem  onerate7 
(E.p. , Is  each  processor  mult lpropranrable?  is  lndepeodent- 
proceaa  mult lproceaalnp  possible?  Is  cooperative-process 
rul t lproprsmmed  multiprocessing  poaalble?) 

3. 1.2.2.  SOFTWARE  orpan 1 za 1 1  on :  What  la  the  structure  of 
the  ayatem  software7  How  la  It  distributed  with  reapect 
to  t lie  hardware? 


3.7.2.  FAULTS  NOT  T01.LRATKD:  What  faults  cannot  be 
tolerated  by  the  ayatem,  and  what  are  the  correspondlnp 
effects7  Identify  the  weakeat  llnka. 

NOTE:  Faults  rjtw  he  charactevlced  In  many  ways,  lncludlnp 
type  (e,p.,  faulty  hardware  at  various  levels  such  aa  a 
chip,  module,  hua,  power  supply,  arithmetic  unit, 
prnceaaor,  memory;  faulty  software  such  es  In  the 
executive.  In  s  compiler,  or  In  an  application*  propram; 
faulty  uaape  and  bad  Inputs),  nature  (e.p.,  tlrlnp 
conalderatl ona,  old  ape,  various  phyalcal  phenomena), 
duration  snd  frequency  (e.p.,  one-shot,  recurrent, 
permanent),  senpe  (e.p..  Isolated  faults,  correlated  or 
Independent  mult  If?*  'suits,  with  varylnp  depreea  of 
prorogation),  effect  (random,  predi ctsble) ,  etc. 

3.2.3.  TECHNIQUES:  What  basic  techniques  are  eiaployed  to 
provide  fault-tolerunt  capability,  and  when,  where,  and 
how  are  they  used?  Include  hardware  and  software 
techniques. 

NOTE:  Applicable  techniques  Include  (possibly  In 
combination)  replication  (e.p,,  triple-modular  redundancy 
at  various  levels,  redundant  computations  uslnp 
Independent  alporlthms),  codlnp  (e.p.,  error-detecting  or 
-correctlnp  codes  on  a  bus,  In  remocy,  lo  arithmetic), 
repetition  and  rollback,  reconfl puratlon  (lncludlnp 
removal  without  replacement  and  replacement  with  spares), 
dlapnoatlca  (e.p.,  atsnd-alone,  on-line.  Interactive; 
preventive,  emerpeocy;  remote,  local),  protection  (of 
processes,  data,  proprama,  etc.),  and  outside  intervention 
(human  or  otherwise).  Theae  techniques  may  be  used 
statically  (e.p.,  alwsya  Invoked)  or  dynamically  (e.p., 
cnnflpured  aa  needed);  at  various  module  levels  In 
hardware  and  software;  In  combination  with  certain  events 
snd  with  certain  other  technique*. 

3.3.  NOVELTY:  Vhst  are  the  most  unusual  dealpn  features? 

3.4.  ISFU'ESCF.S :  What  other  efforts  (systems,  research) 
have  had  an  Influence  on  your  system  dealpn? 

3.5.  HARlJ-COREs  If  there  is  a  concept  uf  "hard-core"  In 
vour  avster,  whet  la  its  significance7  (Please  define 
your  concept.) 


4,  TESTIFICATION  for  the  system 

4.1.  RELIABILITY  EVALUATION:  How  la  reliability  estimated 
and/or  demonstrated  (e.p.,  via  analysis,  simulation', 
stimulation  of  faults,  theoreticsl  srpuiaents)? 

4.2.  COMPLETENESS  OF  EVALUATION:  How  complete  Is  your 
dealpn  evaluation7 

4.3.  OVERHEAD:  What  percentape (a)  cf  total  system 
resources  do  you  attribute  to  the  achievement  of 
fault-tolerance?  (Consider  cost  loplc,  execution  time, 
memory,  etc.,  as  applicable.) 

4.4.  APPLICABILITY:  What  la  r’.e  potential  rsope  of  applic¬ 
ability  bevond  that  stated  In  sections  2.1  -  2.4  above? 

4.5.  FXTENDAD1L1TY:  In  what  weva  could  the  system  dealpn 
he  advantapeoualy  extended,  with  what  Increase  In  coat, 
and  to  what  effect7 

4.6.  CRITICALITIES:  Mow  critically  do  the  design  choices 
match  the  deslpn  goals’  (F.p.,  could  alight  chenpea  In 
pnal*  result  In  preat  savings  In  design,  lsrplementatloo, 
and/or  operation?  Is  mil t lprop ransnlne  or  amltlproceaslnp 
critical7  Is  the  choice  of  herdwere  crltlcel?) 

4.7.  IMPLICATIONS:  What  special  requirements  (If  any)  does 
the  basic  deslpn  Impose  (e,p,,  on  the  hanfcmre  designers, 
on  the  software  developers,  on  users  and  maintainors) 7 


5.  CONCLUSIONS 

5.1.  STATES:  What  Is  the  current  stetus  of  the  ayatem? 

5.2.  EXPERIENCE:  What  conclusions  can  you  reach  baaed  on 
vour  experience  with  the  ayatem  to  date  (e.g.,  In  dealpn, 
lmp,ementritlon  and  operation)? 

5.1.  TTTl'RF:  Whst  it  planned  for  future  development  or  use 
of  the  svstee? 

5.4.  ADVaNCF.S :  What  developments  (theoretical  or 
"tactical)  would  be  desirable  for  alpnl flcantly  advancing 
the  state  of  the  art  In  fault-tolerant  computing?  y 


3.2.  FAELT  TOLERANCE 

3.2.1.  FAULTS  TOLERATED:  What  faults  are  tolerated  by  the 
svatee,  with  whet  reaultlnp  effects  on  system  behevlor? 


6.  COMMENTS  (Please  Include  any  comments  on  your  system, 
on  this  questionnaire,  etc.  which  you  woulr*  like  to  add, 
opiniema,  orejudicea  and  nhlloaophlea  are  welcomed. 


SURVEY  OF  FAULT-TOLERANT  COMPUTING  SYSTEMS 

L.  J.  Koczela,  North  American  Rockwell  Corp. 

3370  Miraloma  Avenue,  Anaheim,  Celifomia  92803,  May  1972 


1.  IDENTIFICATION 

1.1.  NAME:  A  Three  Feilure  Tolerant  Computer  System 

1.2.  RESPONSIBILITY:  Electronica  Group,  North  American 
Rockwell  Corp. 

1.3.  SUPPORT:  Manned  Spacecraft  Center,  NASA 

1.4.  PARTICIPANTS:  L.  J,  Koczela,  J.  Juriaon,  0.  Broaius 
-  North  American  Rockwell;  P.  Sollock  -  NASA. 

1.5.  START :  1/1/70 

1.6.  COMPLETION:  1/1/71  (design  concept). 

1.7.  BIBLIOGRAPHY:  A  Three  Failure  Tolerant  Computer 
System,  IEEE  Trana.  on  Computers,  November  1971 


2.  MOTIVATION 

2.1.  PURPOSE:  Real-Time  Central  Guidance  and  Control 
Computer 

2.2.  PHYSICAL  ENVIRONMENT:  Spaceborne 

2.3.  COMPUTING  ENVIRONMENT:  The  computer  aystem  Interacts 
with  avionics  subsystems  via  a  multiplexed  dats  bus. 

2.4.  COMPUTING  OBJECTIVES:  30,000  words  of  memory; 

500,000  operations /second  speed 

2.5.  RELIABILITY  OBJECTIVES:  Must  tolerate  first  two 
failures  with  no  degradation  in  performance  and  third 
failure  with  no  degradation  in  safety. 

2.6.  OYNAMIC  VARIABILITY:  Third  failure  could  have  leas 
computational  capacity. 

2.7.  PENALTIES:  Would  require  manual  intervention  with 
possible  loas  of  life. 

2.8.  CONSTRAINTS:  No  physical  constraints  but  a  relative 
weighting  of  importance  between  phyalcal  parameters. 

2.9.  TRADEOFFS:  Size,  weight  and  power  least  important. 


3.  DESCRIPTION 

3.1.  ARCHITECTURE 

3.1.1.  CONFIGURATIONS 

3. 1.1.1.  INTERCONNECTIVITY:  Four  redundant  computers 
Interconnected  by  four  voter  swltcnes  at  their  1/0 
channels. 

3.1. 1.2.  RANGE:  2-6  CPUs,  no  restrictions  on  word 
length. 

3. 1.1. 3.  CAPABILITY:  500,000  operations/second 

3.1.2.  EXECUTIVE 

3. 1.2.1.  MODES:  The  executive  may  operate  the  redundant 
computers  in  many  modes  of  operation:  non-redundant 
Independent  computers,  multl-progranxned,  multi- computer, 
and  various  combinations  of  redundancy  such  as  comparison, 
voting,  etc. 

3. 1.2.2.  SOFTWARE:  Software  control  is  equally  distributed 
among  the  redundant  computers  -  no  central  control  exists. 

3.2.  FAULT  TOLERANCE 

3.2.1.  FAULTS  TOLERATEO:  Any  3  faults.  A  fault  can  rsnge 
from  a  single  circuit  element  to  a  complete  module  such  as 
s  CPU  falling.  A  failure  hsa  no  effect  on  system  behavior. 
The  system  sctually  tolerate  more  than  three  faults  of 
many  different  types  but  it  i/ill  tolerste  at  least  sny 
three  faults. 

3.2.2.  FAULTS  NOT  TOLERATED:  Software  faults  thst  are  not 
caught  in  debugging. 

3.2.3.  TECHNIQUES:  The  technique  used  is  replication  of 
hardware  with  quadruple  redundancy.  Computations  are 
performed  redundantly  and  reconfiguration  is  accomplished 
without  removal  ur  replacement  after  failure  detection  by 
voting. 


3.3.  NOVELTY:  Through  the  redundant  uae  of  edaptivm 
voters  operating  on  the  Input/output  of  redundant 
computers,  any  three  failure  can  ba  tolerated. 

3.4.  INFLUENCES:  None 

3.5.  HARD-CORE:  No  hard  core  exlste. 


4.  JUSTIFICATION 

4.1.  RELIABILITY  EVALUATION:  Extenalve  fault  almulatlona 
have  been  successful ly  performed. 

4.2.  COMPLETENESS  OF  EVALUATION:  It  la  impossible  to 
verify  a  design  goal  of  100  percent  confidence. 

4.3.  OVERHEAD:  For  triple  failure  tolerance,  about  80%, 
less  for  lower  failure  tolerance. 

4.4.  APPLICABILITY:  To  many  critical  real-time  control 
systems,  industrial,  apace  and  defense  applications. 

4.5.  EXTENDAB1L1TY:  The  design  can  be  extended  to 
tolerate  different  numbers  of  failures,  eg.  any  two 
failures,  any  four  failures,  etc. 

4.6.  CRITICALITIES:  Requirement  for  100X  confidence  in 
tolerating  any  3  failures  la  very  critical,  lowering  to  99 
percent  or  so  would  reduce  complexity  and  cost. 

4.7.  'IMPLICATIONS:  Hardware  designers  must  Inaure 
independence  of  failures  at  computer  I/O  interfaces. 

5.  CONCLUSIONS 

5.1.  STATUS:  System  deelgn  concept  completed, 
voter-switch  detelled  design  completed,  prototype  hardware 
of  voter-switch  currently  under  development, 

5.2.  EXPERIENCE:  A  very  rigid  failure  tolerance 
requirement  can  be  met  assuring  that  a  minimum  number  of 
failures  will  be  tolerated. 

5.3.  FUTURE:  Possible  use  on  space  shuttle  program 

5.4.  ADVANCES:  A  significant  aree  that  can  enhance  the 
state  of  the  ar:  in  designing  fault-tolerant  computers  la 
analysis  of  failure  modes  of  components  and  computer 
s^syatems  in  depth.  Another  very  important  area  is 
error-free  software. 


6.  COMMENTS:  Much  of  the  work  on  fault-tolerant 
computers  is  dedicated  to  single  failures  at  the  gate  and 
circuit  level.  Unfortunately,  in  many  cases  thle  la  not 
applicable  to  real  world  failures  when  considering 
computers  mechanized  from  atalre  of  the  art  LSI.  integrated 
circuits. 


VCS  mechanization. 
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Sl'RVEY  OF  FAULT  TOLERANT  COMPUTER  SYSTEMS 

Alglrdaa  Avlzlenls 

ITLA  Computer  Science  Hept.,  Los  Angeles,  CA  and 
Spacecraft  Computer  Section,  JPL,  Pasadena,  CA,  March  1973 

1.  IDENTIFICATION 

1.1  SAME:  Jl'L-STAR  (Srlf-Teatlnr-And-»epalrlnp)  (omputer 

1.2  RESPONSIBILITY':  Spacecraft  Computer  Section, 
Aatrlonlca  Dlvlalon  of  the  Jet  Propulsion  laboratory. 
Paaadena,  California. 

1.3  SUPPORT:  NASA  -  Office  of  Advanced  Research  and 
Technnlopy  (via  ,fPL) 

1.4  PARTICIPANTS:  A.  Avlzlenls,  0.  A.  Rennela,  I.  A. 
Rohr,  F.  P.  Mathur,  0.  C.  Cl  1  lev 

1.5  START:  1961 

1.6  C0MP1.FT10N:  Operational  -  Spring  1969  (laboratory 
model),  modifications  continue 

1.7,  B1BL10T.RAPHY: 

•A.  Avlztenla,  et  al..  The  STAR  (Sel f-Testlnp  and 
Repairlnp)  Coirpnter:  An  Inveatl  pat  Ion  of  the  theory  and 
practice  of  fault-tolerant  computer  design,  1FEF  Trana. 
Corputer,  C-20,  pp,  1312-1321,  November  1971. 

•A.  Avl  zlenls ,  "Design  of  fault-tolerant  computers  FJCC. 
pp.  733-743,  1967. 

*A.  Avlzlenls,  "An  experimental  sel f-repal ring  computer," 
Information  Processing,  lriP,  Vol.  2,  pp,  P72-R77,  196R, 

•A.  Avlrlenla ,  F.  P.  Hatliur,  D.  Rennela,  and  J.  A.  Rohr, 
Automatic  maintenance  of  aerospace  cou^utera  and 
spacecraft  Information  and  control  svatenw,"  Proc.  AlAA 
Aernsp.  Corput.  Syst.  Conf.,  Paper  69-966,  pp.  1-11, 
September  R-10,  1969. 

•A.  Avlzlenls,  "Concurrent  diagnosis  of  arithmetic 
proceaaora,"  01  pest  of  the  1st  Annual  lECt  Comp.  Coof..  pp. 
34-97,  1967, 

*A.  Avlzlenls,  "Arithmetic  error  codea :  Cost  and 
ef feet  1 veoesa  studies  for  application  in  dlpital  system 
deal pn,”  1EFE  Trans.  Comp,  C-20,,  np.  1322-1331,  Nnv  1971. 

*F.  P.  Mathur  and  A.  Avizienla,  "Reliability  analvala  and 
architecture  of  a  hybrid-redundant  digital  ayatem: 
Generalized  triple  modular  redundanev  with  self- repair." 
SJCC,  pp.  375-3A3,  1970. 

#F.  P,  Mathur,  "On  reliability  rodellnp  and  analvala  of 
dltrarellable  fault-tolerant  dlpital  ayatems,"  IEEF.  Trana, 
Corp.,  C-20,  pp.  1376-11R2,  November  1971. 

•G.  G.  Gil  lev,  "Automatic  maintenance  of  spacecraft  syatero 
for  long-life,  deep-apace  rlaalona,"  Ph.D.  diasertat ion, 
Pept.  Comput.  Scl . ,  ITLA,  September  ]<>70. 

*r.  P.  Mathur,  "Rellshll ity  estimation  procedures  snd  CARE: 
The  computer  aided  reliability  estimation  proRram,"  let 
Propul.  Lah.  Ouart.  Tech.  Rev.,  Vol  1,  October  1971. 

•A,  Avlzlenls  and  D.  Fennels,  "Fault-Tolerance  Experiments 
with  the  1PL-STAR  Computer,"  Proc.  of  the  Sixth  Annuel 
International  Conference  of  the  1EKE  Computer  Society 
(rOMPCO!:),  San  Francisco,  California,  1972,  pp.  321-324. 

*A.  Avlzlenls,  "Arithmetic  AlRorlthms  and  Processor  Desipn 
for  Error-Coded  Operands, "  IEEE  Transactions  on  Computers, 
June  1973. 

*G.  C.  Gilley,  "A  Fault-Tolerant  Spacecraft,"  Direst  of  the 
1972  International  Symposium  on  Fault- Tolerant  Corputlnp, 
Newton,  Maas.,  June  19-21,  1972,  pp.  105-109. 

*F.  P.  Mathur,  "Automation  of  Reliability  Evaluation 
Procedures  through  CARE— The  Computer-Aided  Reliability 
Estimation  Program,"  AF1PS  Conference  Proceedings  (Fall 
Joint  Computer  Conference)  Vol.  41,  Anaheim,  California, 
December  5-7,  1972. 

*J.  A.  Rohr,  "Syetem  Software  for  a  Fsult-Tolerant  DiRital 
Computer,"  Ph.D.  Theale,  University  of  Illinois,  Department 
of  Computer  Science,  l' rbar.fi ,  Illinois,  Eebruary  1973. 


2.  MOTIVATION 

2.1  PURPOSE:  Experimental  laboratory  GP  machine;  suitable 
for  spacecraft  control 

2.2  PHYSICAL  ENVIRONMENT:  Laboratory  environment 

2.3  rovrniNG  ENVIRONMENT:  Local  1/0  facilities 

2.4  COMPUTING  OH.ILCTl VF.S :  Capshle  of  automat  leal  lv 
malntalninp  an  unmanned  spacecraft 

2.5  RELIABILITY  OBJECTIVES :  100,000  hour  survival  with 
0.9S  reliability;  tolerance  of  tranaient  faults;  outage  for 
recovery  below  50  msec, 

2.6  DYNAMIC  VARIABILITY:  Maximum  corputlnp  power  required 
at  end  of  mission 

2.7  PFNAI.T1ES:  None  for  lah  model;  loss  of  spacecraft  for 
fllplit  model 

2.R  CONSTRAINTS:  None  for  lah  model;  for  the  fllplit  model 
the  welpht  of  the  subsystem  was  not  to  exceed  40  lb.  and 
the  power  consumption  waa  not  to  he  greater  than  40  V. 

2.9  TRADEOFFS:  None 


3.  DESCRIPTION 

3.1  ARCHITECTURE 

3.1.1  CONFIGURATIONS 

3.1. 1.1  lNTFRGONMFrTlMTY:  See  Flrure 

3.1.1.?  RANC.F:  One  processor  of  each  cless  (onerating);  lh 
memory  modules  of  4096  worda  each  (naxlmum  operatlnp 
memory) 

3. 1. 1,3  CAPABll.iT'  :  500  FHr  maximum  clock  rate  aod 
bvte-serlal  operat  oo  In  laboratory  model, 

3.1.2  EXECUTIVE 

3. 1.2.1  MODES:  Tlie  entire  aet  of  active  STAR  corputer 
modules  operatea  ss  a  slnple,  Reneral-purpoae  computer. 

The  executive  implerenta  a  two-partition,  lnterrupt-drtven , 
mult ipropramming  environment  on  the  machine.  Four  modea  of 
operation  under  the  executive  are  dlst Inpulahed.  (1)  The 
self-repair  mode  has  hlpheat  priority  and  la  entered 
Immediately  after  hardware  aetf-repair,  TMa  rode 
accomplishes  aelf-repalr  operations  delepated  to  softwsre 
auch  as  memory  reconfiguration  and  program  resumption.  (21 
The  interrupt  mode  la  uaed  to  proceaa  Interrupts.  While  in 
this  mode,  all  lower  nrlorltv  Interrupts  ere  lnhihlted  by 
software.  (3)  The  problem  mode  is  the  normal  mode  of 
execution  for  applications  proprars.  All  active  Interrupts 
are  enabled  when  running  In  the  problem  mode.  (4)  The  wait 
rode  is  similar  to  the  prohlem  mode  except  that  oolv 
low-priority,  cyclic  proRram*  are  run.  The  replsters  of 
wait-mode  proprams  are  never  asved,  and  rhe  program*  can  be 
resumed  at  a  standard  point. 

3. 1.2.2.  SPFTWARF:  The  software  for  the  STAR  computer  can 
be  cateporlzed  into  four  efforts:  the  programming  ayatem, 
the  resident  executive,  the  demonstration  applications 
proprams,  and  the  spacecraft  applications  proRraw.  The 
propramminp  syater  consists  of  an  assembler,  loader, 
functional  slmuletor,  end  programming  executive.  The 
proprsmmlnp  system  lias  been  Implemented  on  the  UNI VAG  1  LOR. 
It  Is  used  to  penerate  programs  for  the  STAR  computer. 

The  resident  executive  which  has  been  designed  for  the  STAR 
computer  la  called  STARFX.  The  STARFX  -outinea  are  divided 
into  ten  categories:  snapshot,  self-repair, 
initialization,  scheduling,  timing.  Interrupt  handling, 
library  management,  facilities  management.  Input-output, 
and  service.  The  STAREX  aelf-repalr  routines  eugment  the 
aelf-repalr  hardware  facilities  by  reconfiguring  the  memory 
and  resuming  applications  programs  after  eelf-repalr. 

STAREX  operatea  In  dupllceted  memory  modulea  end  uaea  a 
single  varlehle  to  maintain  Ita  rollback  point.  (The 
rollback  point  la  the  eddreaa  for  proRram  resumption  after 
self-repair.)  STARFX  also  provides  facilities  for 
applications  programs  to  establish  rollhack  pointa. 

Demonstration  application  progress  have  been  developed  for 
demonstrating  the  STAR  computer  lahoretory  breadboerd. 

~heee  programs  successfully  survive  tranaient  and  elmulated 
■ermanent  faults  and  properly  resume  computation  efter  the 
fault  Is  removed.  Theae  programs  establish  rollbeck  pointa 
ny  celling  the  executive  routines. 
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Spacecraft  application*  programs  have  been  inveutipatad  aa 
part  j(  tha  preliminary  atudlaa  of  the  TOPS  control 
computer  aubayataa  which  wa*  eventually  to  be  ui:ud  on  board 
tNa  Crand  Tour  apacacraft. 


3.2  FAULT  TOLERANCE 

3.2.1  F/UJLTS  TOLERATED:  Tha  principal  goal  of  tha  daaipn 
la  to  attain  fault  tolerance  for  a  variety  of  fault*: 
tranalant,  permanent,  random,  and  cataatrophlc. 


3.2.2  FAULTS  NOT  TOLERATED:  (a)  Tranalanta  at  a  rate 
higher  than  allowed  by  tha  length  of  "rollback"  segments  of 
program*;  (b)  ahorted  bua  wire*  (iaolatore  are  employed)  or 
power  ewltch  "on"  failure*. 


3.2.3  TECHNIQUES:  All  machine  word*  (data  and  lnatruc- 
tlona)  ara  encoded  in  error- detecting  code*.  Fault 
detection  occur*  concurrently  with  program  execution. 

Th-i  computer  la  divided  into  a  eat  of  raplacaable 
functional  unit*  containing  their  own  inetructlon  decoder* 
and  sequence  generator*.  Thia  decentralisation  allow* 
al^tle  fault-location  procedure*  and  aimpllfiaa  ayata* 
lntarfacaa. 


*  Fault  detection,  recovery,  and  replacement  ara  carried 
out  by  epecial-purpoee  hardware.  Memory  reconfiguration  and 
program  raeumptlon  era  accompllahed  by  the  raaldant 
executive. 


*  Tranalant  fsulta  ara  identified  and  their  effect*  are 
corractad  by  the  repetition  of  a  segment  of  the  current 
program;  permanent  faulta  are  eliminated  by  tha  replacement 
of  faulty  functional  unit*. 


*  Tha  replacement  la  implemented  by  powar  awitchlnp:  unit* 
ara  removed  by  turning  powar  off  and  connected  by  turning 
powar  on.  Tha  information  line*  of  all  unite  are 
permanently  connected  to  tha  buaea  through  leolatlng 
circuit*;  iwpowarad  unit*  produce  only  logic  "earo* 
output*. 


•  Tha  arror-dateetlnp  code*  are  aupplementad  by  monitoring 
circuit*  which  aarva  to  verify  tha  proper  aynchronixatlon 
and  internal  operation  of  tha  factional  unit*. 


*  The  "hard  core"  teat  and  repair  proceaaor  (TARP)  1* 
protected  by  triplication  and  replacement  of  failed  metier* 
of  the  triplet. 


3.3  NOVELTY:  Power  awltchlep,  atatua  algnala,  encoding  of 
instruction*  emphaei*  on  trano lent- recovery  with  program 
survival. 


3.4  INFLUENCES;  Theoretical  work  by  Reed  and  Brimley; 
Kruua  and  Seshu;  Cr learner.  Miller  and  Roth. 


3.5  HARD-CORE:  Tha  "hard  core"  monitor  of  the  STAR  ayatam 
ia  designated  a*  TARP  (test  and  repair  processor)  in  tha 
Figure.  Tha  TARP  monitor*  the  operation  of  tha  STAR 
computer  by  two  method*:  (1)  testing  avery  word  eant  over 
tha  two  data  buaaa  for  validity  of  lta  code;  and  (2) 
chackijp  the  status  messages  from  the  factional  unit*  for 
predicted  raaponsas. 


Three  fully  pevared  copiaa  of  tha  TARP  are  oparatad  at  all 
times  together  with  n  atandby  aparaa  (n  •  2  in  tha  preaant 
design).  The  outputs  of  tha  TARPs  are  decided  by  a 
2-out-of-(n+3)  threahold  vote.  Whan  ona  powered  TARP 
disagrees  with  tha  nther  two,  tha  recovery  mode  la  antsred 
and  sn  attaint  Is  made  to  sat  tha  internal  state  of  the 
disagreeing  unit  to  match  the  other  two  unite.  If  this  TARP 
rollback  attempt  fells,  tha  disagreeing  unit  Is  returned  to 
tha  standby  condition  and  ona  of  tha  standby  units  receive* 
power,  goes  through  tha  TARP  rollback,  and  Joins  the 
povared  triplet.  A  standard  rollback  then  occurs  and  the 
raaldant  executive  rasumes  normal  program  operation. 

Becauaa  of  tha  three  unit  requirement,  design  effort  haa 
basn  concentrated  on  reducing  the  TARP  to  the  laast 
poesihle  complexity.  Experience  with  the  present  model  has 
led  to  several  refinements  of  tha  design. 


Tha  replacement  of  faulty  functional  mite  Is  commanded  by 
tha  TARP  vote  and  Is  Implement* d  by  power  switching.  It 
offare  aaveral  advent****  over  tha  switching  of  Information 
lines  which  connact  the  units  to  the  bua.  The  nusfcar  of 


switches  ara  reduced  to  ona  par  unit,  powar  la  cones rvad, 
and  strong  iaolatloe  is  provided  for  catastrophic  failures. 
Magnetic  powar  switches  have  baan  developed  which  are  part 
of  each  unit1*  powar  supply  end  sre  designed  to  open  for 
moat  internal  failures.  The  threshold  function  Is  Inherent 
in  tha  control  winding*  of  the  switch.  Tha  Information 
llnaa  of  each  unit  are  permanently  connaetad  to  the  buaaa 
through  componant-radwdant  isolation  circuits.  Tha  signal 
on  a  bus  la  tha  logic  OR  of  all  Inputs  from  tha  units,  and 
unpow  ’■ed  unite  produce  only  logic  eero  outputs.  The  power 
switch  and  the  buss*  utillea  component  redundancy  for 
protection  against  fatal  "shortlnp"  failures. 


4.  JUSTIFICATION 

4.1  RELIABILITY  EVALUATION:  Tha  coaiputing  operation*  for 
tha  analysis' was  dona  with  the  aid  of  tha  eonputar-alded 
reliability  estimation  (CARE)  program,  which  was  developed 
as  a  design  tool  during  the  reliability  study.  CARE  Is  a 
snftvare  package  developed  on  tha  Unlvac  1108.  CARE  may  ba 
Interactively  accaaaad  by  a  deaigner  from  a  talat/pe 
console  to  calculate  hie  reliability  estlmatae.  Tha  input 
la  In  tha  form  of  a  ayatam  configuration  daaerlptlon 
followed  by  querlaa  on  the  various  reliability  parameter* 
of  lntareat  and  their  bahavlor  with  reepact  to  mlaalon 
time,  fault  coverage,  failure  rates,  dormancy  factor*, 
allocated  eparae,  and  oartltloning.  Tha  CARE  program  la 
extensible,  and  it  may  ba  updatad  to  Incorporate  new 
reliability  modal*  a*  they  become  available.  A  aaeond  aat 
of  program*,  tha  Reliability  Modeling  Syetem  (RMS),  wee 
developed  aa  a  tool  In  tha  experimental  verification  of  tha 
STAR  breadboard.  Thla  aat  of  prograam  computaa  tha 
rallsbillty  of  tha  various  subsystam  configuration*  ualeg 
"coverage"  parametare  experimentally  obtained  by  Inserting 
faults  Into  tha  system.  RMS  le  an  Interactive  ays  tv* 
implemented  by  APL. 


4,2  COMPLETENESS  OF  EVALUATION:  Phyalcal  fsmlt-lnjactlon 

experiment*  are  currently  In  progree*  mad  are  axpoeted  to 
ba  completed  in  1973. 


4.3  OVFRHEADi  Depanda  on  tha  number  of  agmree 
apare  for  aach  module,  tha  system  1m  about  MY 
(l.a.,  about  150  percent  extra  coot  for  famlt  tola 


kith  . 


mt 

m)  • 


4.4  APPLICABILITY i  Varioua  real-tin 
require  very  feat  recovery. 


applications  that 


4.5  EXTENDAB1L1TY:  Spar*  proceaaor*  coold  ba  utlllead  in 
a  multiprocessor  mods.  Additional  buses  and  supervisory 
mechanisms  would  ba  required. 


COMROl  BUS  lJ* 
STATUS  U WS 

sniTCH  uses 


STAR  compuier  organization. 


LOP 

MAP 


ROM 

RWM 


Control  processor,  contains  tha  location  counter  snd 
index  registers. 

Logic  processor,  (two  copies  ere  powered). 

Main  erithmetic  processor. 

READ-ONLY  memory,  16,384  permanently  stored  words. 
READ-WRITE  memory  unit  (4096  words,  two  copies 
powered,  12  units  directly  addressable. ). 
Input/Output  processor. 


contains  I/O  buffer. 


ni|/uni.UL|;i.L  -  — - 

IRP  Interrupt  processor,  handles  interrupt  request. 
TARP  Test  and  repair  processor,  (three  copies  powered) 
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4.6  CRITICALITIES:  Tha  daalgn  goal  waa  a  1. attar 
underetandinf  of  replacement  ayatatm.  In  ordar  to  ratain 
contact  with  tha  practlca  of  compute;  design,  it  waa 
dacldad  to  daalgn  and  conatruct  an  axpcrlnent^l 
lanaral-purpoaa  digital  computer  which  would  incorporate 
dynamic  redundancy  (i.a.,  fault  dattction  and  raplaotscnt 
of  fallad  aubayatama)  aa  an  integral  part  of  ita  atructura. 
Tha  daalgn  objactlvaa  hava  baan  carviad  out  and  the  ayataat, 
callad  the  STAR  computer,  began  operation  in  1969  .  Tha 
modular  nature  of  tha  STAR  computer  haa  allowed  ayateratic 
expansion  and  aodlflcationa  that  are  atill  beinp  continued. 

An  aarly  objective  of  the  dealpn  ia  to  atudy  tha  cleae  of 
problems  which  ara  ancountared  in  transforming  the 
thaoratlcal  modal  of  a  aa 1 f-rapalring  ayatam  into  a  working 
computer.  Stata-of-tha  art  intapratad  circuit  and  memory 
technology  waa  employed  in  tha  daaign.  Thle  objective 
appaara  to  hava  baan  attained  reaaonably  wall. 

4.7  1MPL1  CAT  IONS  :  Daaignara  ruat  giva  (a)  advance 
attention  to  modularization  and  cod«d  operand*;  (b)  apeclal 
aoftvare  faaturaa  ara  naadad  (aaa  3. 1.2. 2);  (c)  uaara  ssuat 
obaerva  "rollhack”  rulaa  in  programing . 

5.  CONCLUSIONS 

5.1  STATUS:  Operating  in  laboratory;  bainr  extanaivaly 
taatad  and  modlfiad  to  improve  waaknaeeaa  that  are 


5.2  EXPERIENCE :  Practical  implementation  of  replacement 
ayjtama  la  faaalbla.  Tranalent  faulta  can  ba 
ayatamatlcally  eliminated  without  prngram  loaa.  Tranalent 
tolaranca  can  be  epaciflad  in  terma  of  "duration  *  and 
"frequency"  parametera. 

5.3.  FUTURE:  Tha  reaearch  and  developecnt  program  which 
lad  to  tha  STAR  computer  ia  continuing  in  saveral 
direction!.  Analyala  of  automatic  maintenance  algorithm! 
and  daalgn  of  e  comaiand/data  bua  for  thalr  implementation 
ara  under  intanaive  atudy.  Other  current  lnvae tlRat 1 -ns 
are  concerned  with  tha  following  areaa  :  (1)  hardwara- 

aoftwara  interaction  in  a  fault-tolerant  eystem  with 
recovery,  eepeclally  tha  lntaractlon  batvaan  tha  TARP  and 
the  raaldant  executive;  (2)  tiding  of  tha  resident  execu¬ 
tive  to  optimise  performance  with  regard  to  rallback,  both 
In  tha  executive  and  applications  programs:  f3)  atudlea  of 
advanced  recovery  tacholquea,  l.a.,  poat-  c.' cestrophlc 
restart,  TARP  replacement  achemme,  recovery  from  aiaaalve 
Intarfarenca,  partial  utlllaatlon  of  fallad  imlti;  (4) 
advanced  component  technology,  aapeclellv  methods  to  attain 
bua  and  power  switch  (l.a.,  hard  core)  immvilty  to  faulta; 
(5)  haurlatlc  atudlaa  of  fault  tolerance  by  Interpretation 
of  extern lve  axpariment#  with  tha  STAR  breadboard  aa  the 
instrument;  (6)  daalgn  of  a  second-generation  STAR-type 
computer  with  universal  nroceaaor  end  storage  moduli*,  and 
thalr  Implementation  by  large-scale  Integration;  (7) 
computational  utilisation  of  tha  spare  unite  for 
aupplamantal  taaka  In  a  multiprocessing  mode. 


6,  CO’HENTS:  Daalgn,  construction,  and  tasting  of 
laboratory  models  la  critically  important  to  advance  the 
state  of  tha  art  and  to  gain  acceptance  among  practitioners 
of  design  In  Industry. 

The  STAR  computer  breadboard  conalati  of  three  Raad-Writa 
memory  unit*,  one  Read-Only  ammory  unit,  one  copy  of  each 
of  tha  processing  moduli!,  and  one  TARP  (Taat  and  Repair 
Processor).  Tha  breadboard  provide*  adequate  fscllltiaa  for 

*  experimental  verification  of  the  fault  detection, 
diagnosis,  end  recovery  algorithm*  eaiployad  in  thia 
construction,  and  for 

*  tha  development  of  fault-tolirant  software  technique*. 

Tha  development  of  tha  breadboard  resulted  In  a  direct 
confrontation  with  tha  technological  problem  aria  in 
fault-tolarant  computing,  i.e.  busing,  isolation,  power 
switching,  ate.  Thia  resulted  in  a  batter  understanding  of 
than  problem  and  a  aat  of  innovative  solutions. 


SURVEY  or  'AULT-TOLERANT  COMPUTING  SYSTEMS 


Jacquee  J.  'ulamara,  Electroniquo  Marcel  Daiiault 
(E.M.D. ),  Sj,  quai  Carnot.  92  -  Saint-Cloud  France,  June 
1972 


1.  IDENTIFICATION 

l.i„  NAME:  HECRA  (Maquatte  Experimental*  de  Calculateur 
a  Reconfiguration  Auiomatique) . 

1.2.  RESPONSIBILITY:  E.M.D.  (Elect ronlque  Marcel 
Dassault) . 

1.3.  SUPPORT:  Support  haa  three  aources:  D.C.R.S.T. 
(Delegation  Cenarala  a  la  Recherchi  Scientl flque)  with 
preliminary  atudhi; D. R.M.E.  (Direction  dea  Rechercoea  ac 
dea  Moyena  d’Eaasis)  with  realization  of  MECRA  project; 
E.M.D.  (Electronlque  Marcel  Daaaault)  in  each  caae. 

1.4.  PARTICIPANTS:  Jacquea  J.  Delamare,  Garard  Germain, 
Jean-Claud*  R.  Charpentler,  all  of  E.M.D.,  and  four 
researchers  from  "Centre  da  Calcul  Numerique  de  Toulouae". 

1.5.  START:  May  1970 

1.6.  COMPLETION:  July  1972,  thia  conalata  of  a 
demonstration  of  fault  tolerance  and  reconfiguration 
capabilities.  Evsluntion  of  reliability  performance  is 
expected  to  be  In  Autumn  1972. 

1.7.  BIBLIOGRAPHY:  "The  MECRA:  a  Sell  Reconf igurable 
Computer  for  Highly  Reliable  Proceaa",  IEEE  vol  C-20  no. 
11,  pp.  1 382— 1 388 ,  Nov.  1971.  A  report  also  due  end  of 
1972. 


2.  MOTIVATION 

2.1.  PL’RPDSE:  The  system  was  conceived  for  research  In 
fault-tolerant  computer  srchi tecture ,  feasibility,  and 
rel iabi II  tv  evaluation.  The  idea  for  further  development 
Is  a  real-time  medium-sized  computer  for  aircraft. 

2.2.  PHYSICAL  1NV1RONMENT:  System  operates  In  EMD 
laboratories. 

2.3.  COMPUTING  ENVIRONMENT:  A  alngle  peripheral  allows 
communication  with  MECRA. 

2.4.  COMPUTING  OBJECTIVES:  Main  objectives  of  the 
project  were  not  computing  objectives.  However  addition 
ant'  multiplication  are  performed  with  11  decimal  digits 
plus  sign  operands.  Complete  addition  naeda  leaa  than  300 
miciosec.  Such  delay*  relate  to  the  cycle  time  of 
microprogram  memory  (1  microaec),  to  reaponae  time  of 
discrete  circuits,  to  unused  time  intervals  in  each 
microinstruction  cycle,  (allowing  hardware  modifications), 
and  lastly  by  the  microsoftware  package  (allowing 
reconfiguration) . 

2.5.  RELIABILITY  OBJECTIVES:  Practical  experience  and  a 
conoete  baals  for  evaluation  such  as: 

reliability  Rain  with  different  kinds  of  redundancy, 
hardcore  contribution  In  failure  probabll 1  ties, 
hardcore  contribution  with  different  archi tecturee, 
reliability  gain  with  reconfiguration, 
cost  Increase  in  control  with  reconf lRurablllty, 
lost  time  due  to  reconfiguration  (durlnR  and  after), 
hardcore  response  time  with  respect  to  computing  time. 
These  reliability  objectives  were  only  of  interest  for 
high  probabilities  of  success  (prcbabl 1 i tiea  higher  than 
.9). 

2.6.  DYNAMIC  VARIABILITY:  Computing  speed  but  not 
accuracy  may  degrade  with  reconfiguration  (202  maximum). 
Performance  cannot  be  exchanged  for  increased  reliability 
such  as  :  two  processors  each  one  having  Its  own  Job, 
switched  to  parallel  processing  on  the  aame  Job  and 
checking  one  another. 

2.7.  PENALTIES:  Penalties  from  faulty  operation  can  be 
of  several  kinds:  /Losa  of  time  due  to  recovery  proceaaea, 
lessened  performance  after  ■■  lf-reconf iguratlon,  loaa  of 
service./  Manual  interventions  have  not  been 
investigated,  but  will  be  nacasaarlly  improved  as  a 
consequence  of  self-testing  and  aelf-heallng  capabilities 
of  MECRA. 

/SRI  note:  The  text  enclosed  In  alashea  is  an  SRI 
paraphrase  of  the  original  survey  raaponse./ 

2.B.  CONSTRAINTS:  Circuitry  alza  miRht  not  excaed  four 
times  the  alza  of  the  equivalent  lr redundant  computer. 
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3.  DESCRIPTION 
3.  i  ARCHITECTURE 

3.1.1.  CONFIGURATIONS 

3.1.1.1.  INTERCONNECT! VI TY :  See  IEEE  paper.  Hie  basic 
configuration  Is  b  ml  croprog rammed  monoprocessor  with  a 
bus  architecture.  A  restriction  can  be  seen  here  since 
addresses  are  binary  coded,  whereas  dat;i  are  Decimal 
Hamming  coded.  Thin  lias  no  Importance  for  the  purpose  of 
the  project,  but  wnuld  not  have  been  used  on  a  prototype. 

3. 1.1.2.  RANGE:  Control  Unit  Configuration: 

Maxlmtm  Minimum 


4  counters 

3  spare  counters 
8  registers 

4  spare  registers 

3  multiplication  processors 
2  addition  processors 

4  'and  1  logic  processors 
4  'or'  logic  processors 

4  'exclusive  or'  processors 
4  'Inverter'  blorks 


1  counters 
0  spare  counters 
6  registers 
0  spare  registers 
1  multiplication  processor 
1  addition  processor 
3  or  2 
3  or  2 
3  or  2 
3  or  2 


Note:  Anv  logic  function  csn  fall  completely  and  can  be 
reconfigured  with  three  other  functions.  In  several  cases 
a  failed  logic  function  can  be  reconfigured  with  only  two 
other  function. 

Memorv  configuration:  Three  memory-  blocks  -  4  K  Ifi-blr 
words.  Each  memory  block  has  Its  own  address  decoder 
circuits.  At  each  memory  cycle  a  48-blt  word  Is  read  or 
written;  this  word  contains  two  Identical  words  of  24  bln 
escli,  so  that  anv  one  of  the  three  blocks  can  be  declared 
void  and  the  computer  still  runs  If  the  other  two  operate 
properly.  Efficiency  of  address  error  detection  reaches 
50X  on  each  memory  block.  After  any  read  restore  cycle, 
each  eight-bit  byte  (6  hvtes)  Is  checked  and  Is  switched 
or  not  on  busses.  Then  error  detection  efficiency  is  502 
with  Instructions  or  microinstruction  (If  there  is  only 
one  erroneous  bit)  and  1001  with  data  (if  there  Is  nne  or 
two  erroneous  bit). 


3.1.2.  EXECUTIVE 

3,  1,2.1.  MODES:  MECRA  Is  a  monoprocessor. 

3.  1.2,2,  SOFTWARE:  The  re  are  three  working  modes  on  the 
computer:  user  mode,  tes t-dl agnosls  mode,  decision  snd 
reorganization  mode. 

a)  in  the  USER  mode  the  computer  executes  the  user 
program. 

b)  The  TEST-DIAGNOSIS  mode  Is  set  In  motion  In  two 
different  ways  to  which  two  different  programs  correspond. 
The  first  Is  set  In  motion  by  Interrupts  when  a  failure 
has  been  detected  bv  hardware  checkers.  The  goal  of  this 
program  is  to  localize  precisely  where  the  failure 
occured.  The  second  program  Is  cot  In  motion  periodically 
and  Its  purpose  Is  to  test  the  computer  with  the  data 
configurations  which  reveal  failures  best.  Tills  program 
allows  detection  of  the  errors  which  cannot  be  detected  hv 
the  hardware  checkers  (l.e,  an  erroneous  data  with  correct 
encoding).  These  two  programs  update  a  status  table  which 
contains  the  status  of  computer  components  (failed  or  not, 
number  of  transient  failures).  They  also  decide  to  stop 
the  computer  when  certain  catastrophic  failures  occur  or 
to  set  In  motion  the  declslnn  and  reorganization  mode. 

c)  In  the  DECISION*  AND  REORGANIZATION  mode,  a  program 
analyzes  the  status  word  (in  the  status  table)  of  the 
component  In  which  one  nf  the  two  tes t-dlagnosl s  programs 
has  detected  a  permanent  failure  and  it  decides  either  to 
reconfigure  or  to  stnp  the  computer. 

3.2.  FAULT  TOLERANCE 

3.2.1.  FAULTS  TOLERATED:  Anv  single  fault  Is  tolerated 
in  memories,  arithmetic  and  logic  units  (since  they  are 
mounted  in  a  duplex  scheme)  or  in  logic  units  (quadded 
redundancy).  Any  error  d<  tected  on  the  busses,  switches 
the  MECRA  to  Interrupt  programs,  while  all  writing  in 
memories,  registers  or  counters  Is  inhibited.  Multiple 
errors  can  also  be  tolerated  In  number  of  cases.  Multiple 
errors  can  lead  to  repair  or  to  loss  of  service  as  said 
above  (2.7.). 

1.2.2.  FAULTS  NOT  TOLERATED:  Faults  not  tolerated 
Include  errors  in  the  main  ontrol  circuit,  which  leads  to 
a  design  with  an  Increased  degree  of  microprogramming  and 
minimised  control  circuits.  Also  not  tolerated  are 
errors  undetected  at  the  memory  output.  Power  supply 
failures  have  not  been  investigated  In  MECRA. 


3.2.3.  TECHNIQUES:  One  of  the  goals  of  MECRA  Is  an 
investigation  of  as  many  faul t • tolerrnce  techniques  ss 
possible,  such  as  triple  modular  redundancy,  quadded 
redundancy,  duplex  redundancy  at  very  low  level  (clock) 
and  higher  level  (memories  and  arithmetic  circuits), 
random  redundancy  (counters,  registers),  error  detecting 
codes  (Hamming  d  ■  3)  and  parity  bit,  repetition, 
rollback,  reconfiguration  with  removal  without 
replacement,  reconfiguration  with  replacement,  diagnosis  - 
stand-alone,  preventive  and  emergency,  local  protections 
of  process  and  data.  These  techniques  are  used  statically. 

It  does  not  seem  possible  to  describe  these  techniques  In 
detail  In  this  paper,  since  it  would  require  s  description 
of  the  whole  computer.  Other  techniques  were  also 
Investigated  but  not  used  on  MECRA,  aucli  as  stopping  the 
computer  during  noisy  periods,  and  control  of  correct 
microprogram  linking. 

3.3.  NOVELTY:  When  the  project  started,  two  Ideas 
unusual  in  the  literature  were  employed  in  MECRA;  address 
decoder  redundancy  In  memories  so  as  to  separate  address 
errors  and  data  errors,  s ingle-error- f ree  hard-core. 

3.4.  INFLUENCES:  A  synthesis  of  efforts  which  came 
almost  exclusively  from  the  U.S.A.  -  universities, 
laboratories,  and  research  institutes. 

3.5.  HARD-CORE:  ihla  Is  defined  as  a  circuit, 
interconnecting  aeversl  redundant  functions,  whatever  Its 
own  redundancy  level  (it  is  a  relative  concept). 


4.  JUSTIFICATIONS 

4.1.  RELIABILITY  L VALUATION;  Reliability  U  not 
demonstrated,  It  Is  computed,  in  two  steps  using  a  model. 
The  first  step  concerns  analysis  and  drawing  a  network 
model,  the  second  step  concerns  random  failure  assignment 
Intn  the  model.  After  a  great  numher  of  trials,  the 
program  furnishes  results  (e.g.  curves,  marginal 

prohahl lities. . , ) . 

4.2.  COMPLETENESS  OF  EVALUATION:  Program  evaluation  Is 
now  being  tented. 

4.3.  OVERHEAD:  Approximately  60i  to  70Z  of  total  system 
resources  are  devoted  to  fault  tolerance  (ssme  percentage 
for  logic,  cost,  and  time). 

4,f>.  CRITICALITIES:  Use  of  decimal  coded  characters 
seems  not  well-fitted  to  f aul t-> olerant  computers.  This 
change  could  result  In  great  s?  zings  In  design.  Other 
points  sre  not  critical. 

4.7.  I  Mi’Ll  CATIONS :  The  basic  design  assumes  low-level 
Integrated  circuits,  wtlh  a  very  small  number  of  different 
circuits. 


5.  CONCLUSIONS 

5.1.  STATUS:  The  system  Is  now  operating  and  will  be 
delivered  In  July  72,  evaluation  will  follow  during 
October  and  November. 

5.2.  EXPERIENCE.:  Everything  Is  possible,  except,  perhapa 
a  sufficiently  low  cost,  and  reliable  packaging  and  wiring 
of  components.  Note  that  LSI  would  put  problems  to 

f aul t-tolerant  computers  because  they  need  more  pina  to 
check  redundant  functions  before  connecting  all  together. 
This  would  probably  lead  to  almultaneous  use  of  LSI,  MSI 
and  small  scale  Integrated  circuits.  Component 
manufacturers  have  not  yet  taken  Into  account 
f suit-tolerance  constraints,  but  they  will  probably  do  so 
soon, 

5.3.  FUTURE:  First  prototype  Is  projected  1976  -  1977, 
Current  computer  Is  projected  I960,  Use:  Mlasllea, 
aircraft,  real-time  monoprocessors, 

5.4.  ADVANCES:  Different  fault  tolerant  computers  can  be 
roughly  compared  in  terms  of  reliability  versus  mission 
time;  hut  ihls  will  fall  hack  to  evaluations  of  components 
and  wiring  bTBF.  Such  data,  estimated  by  constructors,  do 
not  seem  to  give  a  sufficient  common  basis  for 
evaluations.  Theoretical  and  conventional  data  or. 
component  MTUF  seem  to  be  needed  for  accurate  comparisons 
among  different  fault-t olerant  computers. 
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SURVEY  OF  FAULT-TOLERANT  COMPUTING  SYSTEMS 

Barry  R.  Borgereon,  Computer  Systems  Research  Project 

University  of  Geliforais,  Berkeley,  May  1972. 


1.  IDENTIFICATION 

1.1.  NAME;  PRIME 

1.2.  RESPONSIBILITY;  Corputer  Systems  Research  Project 
(CSRP) ,  l'.  C.  Berkeley 

1.3.  SUPPORT:  ARPA  -  Contrect  No.  DAHC7D  15  C  D724 

1.4.  PARTICIPANTS:  Herbert  B.  Beskin,  Principal 

Inves  lgstor ;  Ropier  Roberts,  Principal  Proprswner;  Barrv 
R.  R'-rperson,  Head,  Hardware  R  &  D. 

1.5.  START:  7/1/70 

1.6.  COMPLETION:  First  prototype  to  be  running  about  9/73 

1.7.  BIBLIOGRAPHY: 

•H.  B.  Bsekln,  B.  R.  Borperaon  and  F.  Rnberte, 

"PRIME  -  An  Architecture  for  Terminal  Oriented  Sy'f  :ms," 
Proceedings  of  the  1972  SJCC,  AFIPS  Prees  pp.  431*  *37, 

*B .  R.  Borperaon,  "A  Fail-Softly  System  for  Time 
Sharlnp  Use,"  Dip.eet  of  the  1972  Interoat ionel  Fault 
Tolerent  Corputinp  Symposium. 

*J,  T.  Quatse,  P.  Gaulene  and  1).  Dodpe ,  "The 
External  Access  Network  of  e  Modular  Computer  System," 
Proceedings  of  the  1972  SJCC,  AFIPS  Press,  pp.  7B3-790. 

•R.  S.  Fabry,  "Dynamic  Verification  of  Operating  System 
Decisions,"  CSRP  Document  No.  P-14.1,  I’niv.  Califoroia, 
Berkeley ,  11/72  (To  be  publ.  CACM). 

*B,  R.  Borperaon,  "Spontaneous  Reconfiguration  in  a 
Fail-Softly  Computer  Utility,"  Digest  of  DATAFAlR  73, 
Nottingham  England,  April  1973. 

B.  R.  *Boreerson,  Barry  R. ,  "Dynamic  Confirmation  of  System 
Integrity,"  FJCC  1972,  pp.  89-96. 

2.  MOTIVATION 

2.1.  PL'Kl’OSl  :  General-purpose,  Interactive,  multi-access 
computing. 

2.2.  PHYSICAL  ENVIRONMENT:  Ground  baaed 

2.3.  COMPUTING  ENVIRONMENT:  Remote  access  over  telephone 
linee  end  eventuellv  over  the  Arpenet, 

2.4.  COMPUTING  OBJECTIVES:  This  la  not  the  primary 
motivating  area  in  our  system  design.  We  anticipate  that 
the  original  configuration  of  PRIME  will  support  about  10D 
users  with  a  worst  case  response  time  of  less  than  two 
seconds  for  trivial  Jobe. 

2.5.  RELIABILITY  OBJECTIVES :  Because  we  will  he  able  to 
repair  units  as  they  become  faulty,  we  are  aiming  for 
continuous  oveilshllity.  The  system  performance  should 
never  degrede  below  75"  of  its  peek  cepscltv. 

2.6.  DYNAMIC  VARIABLITY:  Performance  cannot  be 
dynamically  traded  for  reliability.  However,  provisions 
may  someday  he  added  which  will  allow  dynamically  trading 
performance  for  intrsproceas  integrity  (See  Section  6). 

2.7.  PENALTIES:  Hie  effects  of  intrsproceas  date 
contamination  (See  Section  3.3.2)  due  to  system  failures 
will  strongly  depend  on  the  nature  and  purpose  of  the 
process.  There  seem*  to  be  no  wey  to  generalize  about 
this.  If  the  avatem  Itself  were  to  crash,  this  would  no 
doubt  leed  to  e  loss  of  revenue  if  PRIME,  were  transferred 
to  a  commercial  environment. 

2.8.  CONSTRAINTS:  There  are  no  specific  constraints  of 
size,  wgipht,  and  power.  The  eel f-impoaed  constraint  on 
cost  is  to  try  to  build  a  fault-tolerant  aystem  that  Is  as 
close  in  cost  as  possible  to  any  current  system  with 
comparable  power  and  capabilities. 

2.9.  TRADEOFFS:  (Too  compllceted  to  deal  with  briefly; 
see  Sections  4,4,  4.6  and  6.) 


3.  DESCRIPTION 

3.1.  ARCHITECTURE 

3.1.1.  CONFIGURATION 

3.1. 1.1.  INTERCONNECT 1V1TY :  Figure  1  la  a  block  diagram  of 
PRIME.  The  Interconnection  Network  (IN)  allows  any  pro¬ 
cessor  to  connsct  to  any  disk  drive,  external  device,  or 
other  processor.  Each  processor  has  three  such  independent 
peths  into  the  IN.  The  IN  connectivity  remains  univer¬ 
sal  over  the  different  aystem  sizes.  Universal  switching 
between  all  processors  end  ell  armory  blocks  is  not  provi¬ 
ded.  Instead,  each  processor  slweys  connects  to  exactly 
64K  of  memory  regsrdlees  of  the  size  of  the  eyetem. 

3.1. 1.2.  RANGE:  The  PRIME  architecture  will  usefully 
accommodate  from  3  to  about  30  processors.  Each  processor 
could  connect  to  from  16K  to  128K  of  primary  memory. 
Depending  on  the  type  of  disk  drives  used,  from  1  to  5 
drives  per  processor  would  be  reasonable.  The  current 
eystem  hes  been  designed  to  oparete  with  from  three  to 
eight  processors  without  requiring  any  additional  hsrdwere 
or  software  deslpn.  Useful  memory  sizes  range  from  64K  to 
about  256K.  Disk  drives  range  from  about  six  to  24,  Each 
processor  to  be  used  in  the  initial  implementation  of 
PRIME  will  be  e  Meta  4  (Digital  Scientific  Corp.).  The 
Meta  4  is  a  general-purpose,  16-bit,  32-regiater,  90ns- 
cycle  time  microprocessor.  The  memory  la  33  bits  wide, 
about  6D0  na  cycle,  and  made  from  lD24-blt  MOS  chips.  The 
disk  drives  are  double  (track)  density  2314-type  drives 
that  have  been  modified  to  transfer  information  on  two 
heeds  at  a  time.  The  initial  configuration  will  have  five 
processors,  104K  of  nemory,  end  15  disk  drives. 

3. 1.1. 3.  CAPABILITY:  The  repehlllty  is  not  accurately 
known  et  this  tine. 

3,1,2.  EXECUTIVE 

3.1.2. 1.  MODES:  At  any  given  time,  one  processor  is 
designated  the  Control  Processor  (CP)  while  the  rest 

l  oction  as  Problem  Processors  (PPs).  User  precises  are 
run  on  the  PPs.  Multiprogramming  la  not  used,  but 
processes  are  overlep-ewepped.  In  order  to  achieve  a  very 
high  interprocess  Integrity,  it  was  decided  never  to  let 
two  processes  share  memory;  hence,  cooperative-process 
multiprocessing  is  not  possible  with  PRIME. 

3. 1.2.2.  SOFTWARE:  The  system  software  is  divided  into 
three  sections.  There  is  the  Central  Control  Monitor 
(GCM)  which  runs  on  the  Target  Machine  of  the  CP;  the 
Extension  of  the  Control  Monitor  (ECM)  which  resides 
directly  in  the  microcode  of  each  processor;  and  the  locel 
Monitor  (I.M)  which  runs  on  the  Tsrget  Machine  in  ths  PPs. 
The  GCM  is  responsible  for  scheduling  processes,  sllocat- 
ing  resource,  snd  consummating  interprocess  messege 
trensfers.  The  EGM  includes  the  disk,  terminal,  end 
communication  controllers,  logic  fnr  double-checking 
critical  CCM  decisions,  hootstrep  logic,  snd  some  intelli¬ 
gence  to  deal  with  reconfiguration.  The  LM  contains  the 
file  snd  working-set  management  systems.  The  CCM  does  not 
get  involved  with  s  process  sfter  it  has  started  the 
process  up.  The  procedure  followed  by  the  CCM  is  to 
■llocete  the  necessary  resources,  initiate  ths  roll  in, 
end  let  the  LM  snd  ECM  tske  over  from  there.  The  CCM  will 
not  pet  involved  sgsin  until  the  process  either  times  out 
or  blocks  Itself.  The  LM  deals  only  with  user  processes; 
it  is  completely  Isolated  from  the  rest  of  the  system. 
Because  of  this,  users  will  be  free  to  provide  their  own 
LM  if  they  do  not  like  the  standard  one  provided. 

3.2.  FAULT  TOLERANCE 

3,2.1.  FAULTS  TOLERATFD:  PRIME  will  tolerate  all 
Internal  fsults.  That  is,  the  system  Is  expected  to 
continue  operating  even  in  the  presence  of  any  arbitrary 
software  or  hardware  fsults.  The  system  will  reconfigure 
to  run  without  eny  piece  of  hardware  thst  becomes  faulty, 
snd  mechanism*  exist  for  limiting  the  effects  of  any 
software  fault.  PRIME  hss  been  designed  to  provide 
continuous  service  to  (almost)  all  terminals.  In  most 
casea,  s  feulty  unit  will  be  repaired  snd  returned  to 
service  before  another  f si lure  occurs.  However,  the 
system  will  still  continue  to  operate  with  e  substantial 
part  of  the  reeources  removeJ  fiom  active  use.  The  aystem 
should  elmost  never  degrede  to  below  75  percent  of  Its 
maximum  capacity.  In  eddltlon  to  continuity  of  some 
minimum  service,  Interprocess  integrity  violations  sre 
prevented  st  all  times;  this  Includes  the  restively 
unstable  periods  between  the  onset  of  s  fault  snd  the 
detection  and  isolation  of  the  feulty  unit. 
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3.2.2,.  FAULTS  NOT  TOLERATED:  Only  environmental  fault* 

■r*  not  tolerated  by  PRIMP,.  Tha  no  at  coacoon  of  these 
fault*  would  be  In  the  A.C.  power  and  air  conditioning. 
Since  It  la  eaay  to  cae  how  to  back  theaa  reaourcea  up,  no 
a f fort  haa  baan  «u.de  to  lncorporata  fault  tolerance  with 
raapect  to  thaaa  unlta  withiin  PRIME.  While  PRIME  aa  a 
ayatam  will  continua  to  rim  in  apita  of  Internal  failure*, 
individual  proceaaaa  may  occaaionally  get  clobbared.  That 
la,  no  a  pedal  proviaiona  have  baan  made  in  PRIME  to 
guarantee  intraproceaa  integrity.  Nance  tranaient 
falluraa  will  frequently  cauaa  contamination  of 
information  for  aome  proceaa.  Alao  herd  falluraa  will 
often  clobber  one  procaaa  before  baing  detected.  Tha  moat 
aarloua  diaruption  will  probably  occur  when  a  dlak  driv' 
fella.  Whan  thia  happen*,  all  of  the  proceaaora  that  war* 
ualng  that  drive  will  be  a ua pended  until  an  operator  can 
recover  their  data,  althar  by  moving  the  dlak  pack  to 
another  drive,  or  recovering  from  tapea  in  tha  unlikely 
event  of  a  head  craah.  But  even  in  thia  vorat-casa 
cetaatrophy,  only  a  a mall  part  of  tha  uaer*  (about  7 
percent  in  tha  Initial  ayatam}  will  be  affected. 


3.2.3.  TECHNIQUES;  The  baalc  ay a tern- wide  technique  uaad 
to  achiavw  fault  tolaranc e  la  to  allow  the  ayatam  to 
degrade  gracefully  by  reconfiguring  to  run  without  any 
faulty  unlta.  At  tha  heart  of  tha  achame  la  a  dlatrlbutad 
architecture  with  a  multiplicity  of  all  fwctional  unlta 
except  tha  IN,  which  la  dealgned  to  fall  aoltly  on  ita 
own.  Fault  detectl...  la  accomplished  by  a  variety  of 
method*  that  Include  parity  on  memory  and  buaea,  aurvelll- 
ance  taata  on  each  proceaaor  after  each  job  atep,  a  double 
check  on  all  critical  ayatam-wlde  declalona  made  by  the 
CP,  and  fault  lnjactlon  In  auch  area*  a*  error  detector* 
and  tha  aaldom  uaad  raconf lguratlon  logic.  After  a  fault 
la  detected,  an  initial  reconfiguration  cauaaa  a  proceaaor 
not  involved  in  tba  detection  to  become  the  new  CP.  Thia 
virtual  "hard-cora"  then  initiate*  dlagnoatlca  to  locate 
the  faulty  unit,  laolate  it,  and  reconfigure  tha  aye  tern 
to  run  a a  efficiently  aa  poaalble  without  it.  A  email 
amount  of  dedicated  hardware  aeaoclated  with  each 
proceaaor  guarantee*  that  tha  initial  reconfiguration  will 
be  accomplished  proparly.  It  la  poaalble  to  logically 
laolata  each  major  wit  at  ita  ayatam  bowdarlca  to  that 
tha  ayatam  can  run  fine -me ah  dlagnoatlca  or  axerclaa  tha 
hardware  to  aid  In  locating  tha  faulty  component.  In  the 
caaa  of  a  failure  of  tha  Isolation  logic,  any  unit  can  ba 
dynamically  powered  down  to  provide  guaranteed  iaolation 
from  the  rest  of  tha  system. 


3.3.  NOVELTY:  The  dlatrlbutad  nature  of  the  ayatam, 
including  tha  dlatrlbutad  intelligence  in  tha  form  of  the 
ECHe,  provides  a  vary  powerful  structure  whereby  fault 
tolerance  la  achieved  without  the  use  of  any  "reliable" 
hardware.  Vary  high-performance  low-coat  disk  drives  have 
baan  incorporated  in  auch  a  way  aa  to  allow  thasa  device  a 
to  he  uaed  aa  aacond  level  storage,  third  level  atoraga, 
and  tha  swapping  medium.  By  distributing  thaaa  thraa 
functions  over  many  identical  phyalccl  units,  vary  high 
availability  la  achieved  at  what  la  ectually  a  lower  coat 
and  with  higher  overall  performance  than  would  ba  poaalble 
with  three  distinct  type*  of  unito.  PRIME  automatically 
responds  to  faults  by  reconfiguring  to  run  without  the 
faulty  unit.  Since  there  la  a  multiplicity  of  all 
functional  unit*  except  the  IN,  it  la  quits  eaay  to  run 
without  any  particular  unit.  Rather  than  make  tha  IN 
"rail able,"  a  vaora  economical  approach  was  taken  whareby 
carefully  controlled  failure  modes  wera  daalgned  into  it. 
This  results  in  a  failure  within  the  IN  manifesting 
Itaalf  M  ■  failure  of  a  small  morttar  of  porta,  which  la 
equivalent  to  losing  whatever  la  attached  to  thoae  porta, 
and  the  system  was  already  designed  to  handle  that 
eventuality.  Tha  reconfiguration  etructura  is  alao  vary 
interesting.  Wheneve.  a  failure  la  detected,  an  initial 
reconfiguration  takas  place  which  establishes  a  new 
processor  aa  tha  CP.  The  new  CP,  which  la  one  not 
involved  in  tha  detection  of  the  fault,  la  than  wad  aa 
tha  u^>orary  "hard-core"  to  initiate  dlagnoatlca,  locate 
the  fault  if  lndead  ona  exlata,  and  r a  move  the  faulty  unit 
from  tha  ayatam.  Tha  dlatrlbutad  intelligence  of  PRIME 
has  bean  used  to  provida  double  checking  on  all  critical 
ayatam  functions,  which  in  turn  guarantees  that  there  will 
be  no  interprocess  Interference.  Probably  the  moat 
unusual  general  feature  of  PRIHE  with  respect  to  fault 
tolerance  la  that  it  ia  aelf-dlsgnoalng  and  eelf-repslrlng 
without  incorporating  any  "hard-core." 


3.4.  INFLUENCES:  Many  previous  efforts  hava,  of  cou  aa, 
influenced  ua,  but  no  alngla  ayatem  stands  out  aa  ha'  ing 


special  influence. 


3.5.  NARD-CORE:  No,  tha  re  i*  no  "hard-core"  in  PRIME. 
Inataad,  the  concept  of  a  "floating  hard-cora"  exlata 
whereby  a  working  proceaaor  la  praaaad  into  aarvlce  as  the 
Control  Proceaaor  whenever  a  malfunction  la  detactad. 

This  la  conalatant  with  the  overall  ayatam  philosophy  of 
not  having  any  "rallabla"  hardware  anywhere  in  the  ayatam. 


4.  JUSTIFICATION 

4.1.  RELIABILITY  EVALUATION:  Reliability  will  ba 
demonstrated  by  stimulation  of  faults. 


4.3.  OVERHEAD:  Tha  coat  of  tha  additional  hardware  that 
haa  been  incorporated  in  PRIME  specifically  for  fault 
tolerance  la  laaa  than  10  percent  of  the  total  hardwara 
coat  of  tha  ayatam.  Leas  than  10Z  of  each  processor** 
uaaful  time  ie  devoted  to  fault-tolerant  fwctlona,  since 
tha  surveillance  programs  era  run  during  what  would 
otherwise  he  idle  time  while  processes  are  being  swapped. 


4.4.  APPLICABILITY:  PRIME  haa  been  very  carefully 
dealgned  to  perform  economically  in  a  particular  anviron- 
If  it  was  to  ba  uaad  in  another  environment,  a 


^•tailed  analysis  would  have  to  be  performed  to  da t ermine 
what  changes  would  have  to  ba  made  to  allow  it  to  perform 
adequately  in  the  new  environment.  In  particular,  many 
other  potential  environments  would  require  that  ateps  be 
taken  to  guarantee  intraproceaa  integrity. 


4.6.  CRITICALITIES:  The  choice  of  dlak  drives  la  quite 
critical  alnce  a  low  coat/blt  la  necessary  aa  wall  aa  a 
high  bandwidth  due  to  tha  different  fwctlona  theaa  drives 
perform.  Since  3330-type  drives  ware  not  available  when 
thia  design  started,  2314-type  drlvaa  were  selected  and 
modified  to  transfer  et  5MHz.  Alao,  tha  IN  had  to  ba 
carefully  designed  with  well-apaelflad  failure  modes. 
However,  tha  primary  memory  and  tha  proceaaor*  are  dimply 
off  tha  ehelf"  items.  Aa  for  goals,  tha  declalon  to  not 
provide  litrsproceas  integrity  chacke  has  been  carefully 
exploits  In  the  design  of  PRIME  and  haa  provided  a  very 
substantial  coat  savings. 


4.7  IMPLICATIONS:  Naavy  reliance  la  placed  on  periodic 
checking  of  hardware  rather  than  concurraot  checking. 

Thus,  tha  ability  to  inject  faulta  into  the  appropriate 
erase  haa  been  a  difficult  requirement  placed  on  ell  of 
tha  hardwara  dealgnere.  Tha  most  notable  aoftrar* 
requirement  lmpoaad  by  tha  basic  design  ie  tha  daar 
division  of  tha  operating  ayatam  into  three  parts,  one  of 
which  can  ba  furniahad  by  a  uaer.  Tha  only  significant 
requirement  placed  on  a  user  la  that  he  must  ba  ewara  that 
no  lntraprocaea  integrity  checks  are  me  da  (just  Ilka  in 
all  current  tlme-aharlng  eystams). 
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5.  conclusions 

5.1.  STATUS:  The  design  of  PRINK  1*  About  95  percent 
completed,  end  implementat Ion  ha*  berun  on  botli  the 
hardware  and  aoftvare.  The  first  venion  capable  of 
reconfiguring  in  the  presence  of  a  failure  ahould  be 
running  by  September,  f 9  73. 

5.2.  EXPERIENCE:  The  i/iln  conclualon  that  the  responder 
can  maks  regarding  the  Jesign  of  PPIME  la  tlmt  by  somewhat 
llnl ting  the  goal  of  the  PRIME  ayatem,  it  vaa  possible  to 
creste  s  eyatea  thst  shnuld  exhibit  excellent  fault- 
tolarsnt  characteristics  at  a  ruch  lower  incremental  cost 
than  that  of  any  other  f«» 4 t-tolersnt  system  known  to  him. 

?‘3*  FU™W*  The  i>ture  will  be  devoted  to  building 

PRIME.  After  thst,  avalist  Ion  and  tuning  will  take  place 
with  connection  to  the  Arpanet  very  likely. 

5. A.  ADVANCES :  It  seem*  that  the  ewat  aignificant 
development  that  would  aid  the  PRIME  system  would  be  the 
availability  of  a  general-purpose ,  ael f-checkinp 
processor,  Since  100  percent  aelf-checkahi lity  la 
extremely  difficult  to  design  into  a  proceaaor,  the  beat 
course  of  action  here  seer*  to  be  to  wait  for  LSI 
processors  of  sufficient  power  to  be  built.  Theee 
processor*  ahould  be  so  inexpensive,  compared  to  the  reat 
of  the  hardware  coat,  that  running  two  of  ther 
simultaneously  and  comparing  output*  should  be  a  very 
attractive  procedure  economically .  In  fact,  the  current 
processing  a  lament  in  PRIME  could  be  broken  into  aeversl 
aubproceaaora:  one  for  communications,  one  for  the  dlak 
controller,  one  for  the  terminal  controller,  two  for  the 
Target  Machine,  etc.  Probably  only  the  Tsrget  Machine 
proceaaor  would  have  to  be  duplexed  because  the  othera  can 
haw  Independent  checks  on  the  validity  of  their  reaulta. 
With  this  procedure,  intraprocesa  integrity  would  be 
posalble  at  an  Inal gnlf leant  incremental  coat.  For  the 
current  veraion  of  PRIME,  the  availability  of  general 
procedures  for  automatically  generating  teat  program* 
would  he  extremely  valuable. 


6.  COMMENTS :  I  have  experienced  a  great  deal  of  difficulty 
locating  any  other  efforts  at  designing  snd  building  what  I 
enr, aider  to  be  truly  gracefully  degrading  aelf-repairing 
systems.  Moat  of  the  effort  in  fault-tolerant  computing  to 
date  aeema  to  he  centered  around  military  ayatem,  or  even 
moreao,  around  space  exploration  svatems.  This  typically 
dictates  that  s  fixed  amount  of  computing  power  be  made 
available  at  all  time*;  hence,  the  lack  of  action  around 
fail-aoftly  systems.  Of  course,  by  providing  fault 
tolerance  through  graceful  degradation,  very  substantial 
coat  savings  can  be  realized  over  the  "redundant"  method*. 
In  addition  to  allowing  the  system's  performance  to  degrade 
in  the  preaenca  of  faults,  we  have  chosen  not  to  guarantee 
intraprocesa  integrity.  Also,  PRIME  uses  no  "hard-core"  to 
inltitate  diagnosis  or  reconfiguration.  The  emanation  of 
these  three  techniques  lias  allowed  us  to  design  a  very 
economical  fault-tolerant  tlne-aharlng  system.  There  la 
little  doubt  that  the  anticipated  degradations  will  be 
quite  scceptable  for  a  wide  range  of  applications.  The  lack 
of  intraproceaa-lntegrlty  guarantees,  however,  will  be  a 
limiting  factor  in  expanding  this  architecture  into  other 
areas.  Of  course,  hardware  prnvialons  could  be  added  tn 
guarantee  introproceaa  integrity,  and  the  resultant  ayatem 
would  at  ill  be  more  economical  than  e»st  other 
fault-tolerant  cyatens.  A  more  promising  approach,  snd  one 
which  we  will  undoubtedly  explore  in  the  reasonably  near 
future,  la  to  leave  the  hardware  as  la  and  rim  critical 
programs  twice  on  two  different  processors.  Thl*  will 
allow  the  ayatem  coat  to  remain  very  low,  and  will  also 
allow  Intraprocess  integrity  guarantees.  Thus,  only  those 
processes  that  thst  need  this  guarantee  will  have  to  pay 
for  this  added  feature.  A  final  aspect  of  the  PRIME 
architecture  that  should  be  investigated  la  whether  it  can 
more  economically  nrovlde  a  guaranteed  enmputing  power  in 
anme  environments  than  can  be  provided  by  a  "redundant" 
ayatem.  It  can  be  overbuilt  by  an  amount  sufficient  to 
guarantee  that  its  degraded  cnndltlon  la  powerful  ennugh  to 
handle  the  necessary  computing,  with  backgrnimd  power 
available  nost  of  the  time. 


SURVEY  OF  FAl'LT-TOLFRANT  COMPUTING  SYSTEMS 

Capt  Larry  A.  Fry,  Spare  and  Missile  Systems  Organization 
(SAUSO)  Loa  Angelee  AFS,  CA,  February  1173, 

1.  IDENTIFICATION 

1.1.  SAME:  Modular  Spacecraft  Computer 

1.2.  RESPONSIBILITY :  SAHSO/PYT,  Loa  Angeles  AFS,  CA. 

1.3.  SUPPORT:  Not  available 

l.A,  PARTICIPANTS:  Raytheon,  Sudbury,  MA;  UltraayateM, 

Inc.,  Newport  Beach,  CA;  l^glcnn,  San  Pedro,  CA. 

1.5.  START:  Project  started  mid-1971 

1. b.  COKPLFTION:  Loglcon  ia  currently  impleemntlng 

interpretative  computer  simulation*  of  the  two  HSC  designs 
on  the  CPC  7600,  The  srchl tectures  snd  repertoires  sre 
being  evaluated,  along  with  sn  intensive  study  of  the 
fault-tolerance  features.  Delivery  of  the  ICSa  and  a  study 
report  are  due  in  July. 

1.7.  BIBLIOGRAPHY;  II.  Hecht  and  I..  A.  Fry,  "Fault- 
Tolerance  in  the  Modular  Spacecraft  Computer,"  presented  at 
the  6th  international  Hawaii  Conference,  9—1 1  January  1973. 

2.  MOTIVATION 

2.1.  PURPOSE:  Support  of  all  satellite  data  processing 
requirements 

2.2.  PHYSICAL  ENVIRONMENT:  In  satellite 

2.3.  COMPUTING  ENVIRONMENT:  Hardwired  to  envlronrant 

2.4.  COMPUTING  OBJECTIVES:  About  200K  operations  per  see. 

2.5.  RELIABILITY  OBJECTIVES:  nominal  nrobablllty  of 
survival  at  the  end  of  five-year  life  of  0.95;  variability 
achieved  by  adjusting  the  number  of  spares  carried. 

2.6.  DYNAMIC  VARIABILITY:  Essentially  no  variability 

2.7.  PENALTIES:  Loss  of  major  satellite  functions 

2.8.  CONSTRAINTS:  75  pound*  snd  30  watts 

3.  DESCRIPTION 

3.1.  ARCHITECTURE 

3. 1. 1.  CONFIGURATIONS 

3. 1.1.1.  IMTEPCONNECTIVITY:  Both  designs  are 
bua-oriented.  Raytheon  uaea  eight  general  registers; 
I’ltraays terns  uses  a  conventional  AC/MQ  design. 

3.1. 1.2.  RANGE:  Sirgle  processors.  Memory  la  modular  in 
4K  increments,  up  to  6JK  32-bit  words.  I/O  la  variable,  to 
euit  specific  real-time  applications. 

3. 1.1. 3.  CAPABILITY:  Roughly  comparable  to  a  3t.°/^i).  500K 
fixed-point  ADDs/sec;  200K  floating-point  ADDn/aec. 

3.1.2.  EXECUTIVE 

3. 1.2.1.  MODES:  Interruptible  but  not  a  true 
multiprocessor 

3. 1.2.2.  SOFTWARE:  Not  vet  developed.  Will  have  a  real¬ 
time  operating  ayatem,  including  fault -recovery  routines. 

3.2.  FAULT-TOLERANCE 

3.2.1.  FAULTS  TOLERATEO:  Transient  and  permanent— all 
logic  types.  Also  can  tolerate  some  catastrophic  faults, 

3.2.2.  FAULTS  NOT  TOLERATEO:  Faults  resulting  from  major 
physical  damage. 

TECHN1Q,!ES}  Replication;  coding;  repetition  and 
rollhack;  and  configuration.  Techniques  used  statically 
and  dynamically. 

3.3.  NOVELTY:  Extensive  dynamic  redundancy 

3.4.  INFLUENCES:  Not  available 

3.5.  HARD-CORE:  Configuration  Control  Unit  is  triply- 
modular- redundant,  controlling  all  retries  snd  moat 
reconfigurations. 

4.  JUSTIFICATION:  The  failure  probability  of 

non- fault- tolerant  computers  la  too  high  to  permit  their 
use  as  central  data  processors  on  long-life  zpacecraft. 
Consider  a  computer  using  the  equivalent  of  2500  electronic 
part.,  each  of  very  high  reliability  such  that  the  part 
failure  rate  is  10E-R  per  hour.  Then  the  cos-puter  failure 

!?  2?  *  lnE”6  P*r  l,our’  For  4n  «*P«n*ntlal  failure 
dlatribution,  the  five-year  reliability  la  0.37,  the 
reciprocal  of  e:  e  to  the  power  -(40,000  x  25  x  10E-6). 

5.  CONCLUSIONS 

5.1.  STATUS:  Reforming  Interpretive  simulation 

5.2.  EXPERIENCE:  Architecture  very  suitable  for  intended 
application. 

5.3.  FUTURE:  Nnt  available 

5.4.  ADVANCES:  Practical  design  under  weight  and  power 
cnnfltrslnts . 

6.  COMJfENTS ;  Raytheon  began  ita  design  by  using 
duplication  as  a  main  approach,  ^hiU  Ultrasyatema  used 
arithmetic  coding.  Neither  approach  wa.’  entirely 
satisfactory,  and  it  wa*  very  enlighteninR  to  observe  the 
two  designs  converge  in  the  course  of  several  iterations. 
Currently,  both  employ  a  balanced  mixture  of  duplication, 
coding,  and  TMR,  These  designs  studies  also  deaenatrated 
the  impracticality  of  the  Reed  and  Brimley  approach.  Both 
contractors  initially  broke  up  the  computer  into  small 
modules  but  found  that  the  switching  overhead  and  attendant 
complications  overshadowed  the  theoretical  reliability 
improvement.  Larger  modules  are  now  used,  with  the 
compute r-on-a-chip  in  view.  An  exception  waa  the  memory 
module,  where  a  few  spare  bit  lines  seem  useful. 
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survey  of  fault  -tolerant  computing 


SevaraM.  Ornetain,  Bolt  Beranak  &  Newman,  Inc. 
Caabridgg,  Mast  02138,  May  1973 


1.  IDENTIFICATION 

1.1.  NAHEt  High  Spaed  Modular  IMP  (for  the  ARPANET) 

1.2.  RESPONSIBILITY:  Bolt  Baranak  &  Newman 

1.3.  SUPPORT:  ARPA 

1.4.  PARTICIPANTS:  Frank  Haart,  Savero  Omi  vin,  Willie* 
Crowthar,  Beojaain  Barker,  Anthony  Michal,  Micnael  Kralay. 
Martin  Thrope,  all  fro*  BBN. 

1.5.  START:  July  1971 

1.6.  COMPLETION:  Prototype  eiuwr  1973 

1.7.  BIBLIOGRAPHY:  F,  E.  Neart,  A  New  Minicomputer/ 
Multiprocessor  for  tha  ARPA  Network,  Proeaadin gs  of  the 
National  Computer  Conference,  New  Tork,  N.T.,  Juna  1973. 


2.  MOTIVATION 

Stor*  4  Forward  Message  Processor— High  Spa  ad 
IMP  Modular  Vera  ion 

2.2.  ENVIRONMENT:  Ground-baaad.  Remote  diagnosis,  reetart. 

2.3.  COMPUTING  OBJECTIVES:  A  variable  sized  nodal  alawnt 
in  a  nationwide  (and  eoae  overaeaa)  computer  network. 

2.4.  COMPUTING  OBJECTIVES:  Throughput  capability  of  about 
10  Megabits  of  traffic,  Computing  power  to  be  10  tJ*ee 
thet  of  a  etandard  *ini  (euch  ae  the  Honeywell  516). 

2.5.  RELIABILITY  OBJECTIVES:  Machine  *uat  aubetantlally 
improve  on  the  approximately  IX  down  tl*a  of  praeant 
version.  Machine  ehould  run  24  houre  a  day  year-round. 

2.6.  DYNAMIC  VARIABILITY:  We  hope  that  tha  dealgn  will 
eafcody  eoft  failuree  wherein  bandwidth  capability  will 
degrede  with  failure  but  no  funetione  will  be  totally  loot. 

2.7.  PENALTIES:  Reduction  in  coeaunicatlon  fadlltlee  io  a 
net.  Multiple  failurea  can  cauee  loea  of  cownication  to 
certain  nodee. 


2.8.  CONSTRAINTS:  No  explicit  cone  trainee— goal  la  to  have 
e  few  racka  in  elze  and  coat  of  about  $100,000. 

2.9.  TRADEOFFS:  Coat  and  averythlng  else. 


3.  DESCRIPTION 

3. 1.1.1.  INTERCONNECT I VI TY :  Sea  Figura. 

3. 1.1.2.  RANGE:  Ssalleet  le  elngle  processor  alngle  bun 
eyete*  with  a  elngle  logical  *e*ory.  W*  do  not  underetand 
■axlra  alze  conetraint  aa  a  nuefesr  of  phyaical  and 
engineering  problem  (power,  cooling,  rack  apace,  cabllog) 

the  alze  before  logical  boundarlee  are  reached.  Wa 
ere  building  a  14  proceeaor  prototype  and  axpect  that 
eyateae  of  twice  that  alze  ere  not  much  harder. 


3. 1.1.3.  CAPABILITY:  That  rf  a  elngle  Lockheed  SUE  (a  aaall 
■ode*  16  bit  *ini)  200,000  Adda/eac 


3.4.  INFLUENCES:  Hacronodular  project  at  Waahlngton  Univ. 

3.5.  HARIV-CORE:  We  have  tried  to  avoid  thle  conceit  in  our 
ayate*  wherever  we  ;ould.  We  hope  that  it  le  in  thie  very 
avoidance  that  we  tny  laprove  reliability  (eea  3.2.1). 

4.  JUSTIFICATION 

4.1.  RELIABILITY  EVALUATION:  Wa  believe  that  in  a  new 
eye ta*  of  thia  eort  it  i*  difficult  if  not  i^oeelble  to 
■eke  Meaningful  prognoetlcatlone  of  reliability.  We 
believe  our  overall  eyete*  deelgn  le  prone  to  reliability 
if  the  baalc  parte  are  theaealvee  re  aeon  ably  tellable. 

Only  eftar  tha  eyete*  hen  been  runolng  for  a  yeaf  or  two 
will  we  begin  to  underetand  what  lte  real  reliability  le. 

4.2.  COMPLETENESS  OF  EVALUATION:  Not  particularly  with 
ragerd  to  part  failures;  conceptually,  we  believe  it  le 
quite  complete. 

4.3.  OVERHEAD:  Impossible  to  aetl*ate  elnce  thia  wax  not  a 
primary  goal  at  the  outeet.  Hie  original  goal  w«e  high 
bandwidth  and  the  acheae  we  choea  airily  led  natural ly  to 
vhat  aeaned  a  very  reliable  looking  etructur*.  We  heve 
added  relatively  little  (10X)  epedflcally  for  reliability. 
We  could  add  *ore  and  (hopefully)  leprove  the  reliability 
■ora  a a  our  atructure  le  Modular  and  expandable, 

4.4.  APPLICABILITY:  We  hava  only  begun  to  lnveatlgate  theae 
poeeib  ill  tier.  We  hope  there  will  be  Many.  Aaong  theaa  we 
eea  real  tlae  elgnal  procaeelog  and  eoae  epeclallzed 
■ulti-ueer  appllcatlona. 

4.5.  EXTENDAB1LITY :  The  eyete*  le  deelgned  to  ba  generally 
extendable.  That  la  one  of  lte  aaln  polnta.  Certain 
boundarlaa  are  taached  where  the  next  etep  in  expand  on  le 
■ore  coetly  than  prior  etepe.  We  do  not  know  where  hard 
Il*lta  will  appear.  We  believe  they  will,  for  eo*e  tlae, 
ba  of  an  "engineering”  rather  than  "logical”  neture. 

4.6.  CRITICALITIES:  Fairly  well  aetched.  Sloca  goal  waa 
for  variability  the  queetlon  le  not  ton  Meaningful, 

Mult lproceee log  waa  not  a  goal;  it  waa,  for  ue,  a  wane. 
Hardware  choice  wae  for  eultabllity  and  convenience. 

4.7.  IMPLICATIONS:  At  preeent  the  daalgn  le  baaed  on  the 
Lockheed  SUE  bus  atructure  (alight ly  Modified).  It  could 
have  been  based  on  acme  other  computer,  but  leea  eaallly 
end  at  greater  coat.  Until  or  unleee  we  switch,  this  wane 
that  all  unite  in  the  eyete*  follow  the  SUE  bua  iiecipline. 
The  overall  dealgn  was  conceived  for  priblaae  that  could  ba 
broken  conveniently  into  parallel  execu  .able  tiny  taaka. 

It  achlevee  epeod  and  power  by  euch  panllellea. 


3.1.2  EXECUTIVE 

3.1.2. 1.  MODES:  Designed  for  parallel  teak  execution  of 
specially  coded  real-tine  probleaa.  Parallelia*  la  not 
decided  upon  in  advance  but  la  provided  for.  All  proces¬ 
sor*  can  per  for*  all  taska  and  adjust  to  current  work  load. 

3.1.2.2.  SOFTWARE:  Split  into  tiny  (300  *lcroascond)  taska 
which  ara  queued  with  the  aid  of  apecial  hardware  (which 
itself  le  replicated  for  reliability).  All  processor*  cao 
per for*  all  Jobs. 

3.2.  FAULT  TOLERANCE 

3.2.1.  FAULTS  TOLERATED:  Wa  believe  that,  short  of  eyete* 
power  failure,  any  one  piece  of  the  eyete*  can  fail  without 
lose  of  function  but  with  loss  of  bandwidth  capability. 

3.2.2.  FAULTS  NOT  TOLERATED:  Malicious  Manual  interference, 
syetCBic  power  failure,  etc.  Little  protection  against 
sofprara  faults  Included  elnce  the  progra*  la  a  dedicated 
real  tine  progra*  not  subject  to  the  vagarlae  of  "uaara”. 

3.2.3.  TECHNIQUES:  Redundancy  of  perta  and  nonspeclallza- 
tion  of  proceeeors.  Parte  connected  in  a  network  eo  thet 
coBMunlcatlon  pathe  don't  force  epedallzatlon,  e.g.,  1/0 
devices  connect  to  two  buseee— each  of  which  can  ba  reached 
by  any  of  k  processor*.  Power  la  distributed  as  110  AC  and 
power  supplies  are  nodular—  i.e . ,  each  unit  has  its  own  DC 
supply  with  it— also  lte  own  cooling.  The  eyete*  requires 
•  «ch  pi.ee  to  perform  certeio  teeta  periodically  and  one  of 
tha  taaka  required  of  Borne  randomly  aalacted  free  proceor 
la  to  check  up  on  hov  every  on.  alee  le  doing.  Module,  can 
dl. connect  one  mother  fro.  the  ayatam  If  falling  operation 
le  detected  but  protection  le  provided  to  .void  inadvertent 
decoupllog  of  e  good  unit. 

3.3.  NOVELTY :  We  do  not  know  of  e  similar  ayataa  of  a 
collection  of  teak  oriented  "workers”  sharing  responsibi¬ 
lity  oot  only  for  the  routloe  workloads  (with  variations) 
but  also  for  self  teat  and,  if  appropriate,  amputation. 


5.  CONCLUSIONS 

5.1,  STATUS:  Prototype  nearing  co*pletlc ft. 

5.2,  EXPERIENCE:  It  le  herd  to  build  eufh  eyeteaa, 

5.3,  FUTURE:  Thie  IMP  will  be  locorpor  itad  into  the  ARPA 
network  in  various  sized  configurations. 
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SIRVEY  M  |'Al  LT-TOLLKANl  fUKPlTlUf  SYJTLMS 
*.  C.  Carter 

IBM  Thomas  J.  Vat  son  Research  Center 
York  town  Heights  NT  10598 

I.  IDENTIFICATION 

1.1.  SAME :  I  am  reporting  mainly  on  a  long-term  research 
effort  in  techniques  for  fault-tolerant  computer 
architecture.  The  relevant  prior  publications  have  used, 
for  exa>tple,  the  terra  "Modular  architecture", 
"self-repairing  computers",  "dynamic  checking”,  "fault 
diagnosis  ,  stand-fey  sparing"  or  "dynamic  recovery"  in 
the  titles  and  the  authors  have  been  some  subset  of  the 
participants  named  in  I. 4.  For  present  purposes  i  will 
talk  about  a  paper  Modular  Oigital  Computer  system  called 
MT)C  whose  principal  propertied  will  be  specified  later. 
For  reality,  aome  requirements  will  be  imposed  which  have 
nothing  to  do  with  fault  tolerance  per  ae.  This  system 
d  es  not  really  exist,  and  will  nut  exist,  but  is 
i(>ecifled  to  provide  a  focus  for  our  fault  tolerant 
computing  research. 

1.2  RESPONSIBILITY-:  IBM  Research. 

1.3  SUPPORT :  Support  has  come  from  IBM,  V,  S.  Air  Force 
and  NASA. 

PARTICIPANTS:  W.  {..  Bouricius,  W,  C.  Carter,  E.  P. 
Haieh,  D.  C.  Jesuep,  Jr.,  G.  r.  Putzolu,  J.  P.  Roth,  P.  R. 
Schneider,  C.  J,  Tan,  A.  B.  Vadia. 

1.5  START:  Formal  initiation  occurred  In  March,  1966. 

I.t>  COMPLETION  Open  ended.  No  end  item  is  scheduled. 

I. 7  BIBLIOCRaPUY : 

•Roth,  J.  p.  "Diagnosis  of  automata  failures:  a  calculus 
and  a  method",  IBM  Journal, vol.  10,  4,  1966. 

•Bouricius,  W.  c.,  Haieh,  E.  P.,  Tutzolu,  G.  R.,  Roth, 

J. P.,  Schneider,  P.  R.,  Tan,  C.  J. ,  "Algorithms  for 
detection  of  faults  in  logic  circuits",  IEEE  1C,  Vol. 

C-20,  Nov.  1071. 

•Bouricius,  V.  C.,  Carter,  V.  C.  and  Schneider,  P. 
"Reliability  modeling  techniques  and  tradeoff  studlea  for 
self-repairing  computera",  ACM  National  Conference,  San 
Francisco,  California,  August,  1969. 

*8ouricius,  W.  C.,  Carter,  W.  C.,  Roth,  J.  P.  and 
Schneider,  P.  R.,  " lnves tigat Iona  in  the  design  o»  an 
automatically  repaired  computer",  Paper  Number  6.4 
Conference  Dlgeat  of  the  Firat  Annual  IEEE  Computer 
Conference,  Chicago,  Illinois,  September  6-8,  1968. 

•Carter,  V.  C.  and  Schneider,  P.  R.,  "Design  of 
dynamically  checked  computers",  IFIPS,  Edinburg,  Scotland, 
August,  196B. 

•Carter,  V.  C.,  Jessep,  D.  C.,  Wadla,  A.  B.,  "Error-free 
decoding  for  fai 1 ure-tolerant  memories",  1970  IEEE 
Computer  Conference,  Washington,  D.  C.,  June,  1970,  pp. 
229-2  39. 
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2.  MOTIVATION 

2.1  PURPOSE:  Real  time  control,  data  acquisition  and  data 
management . 

2.2  PHYSICAL  ENVIRONMENT:  Aerospace  applications  have 
predominated  in  specific  design  deciaiona.  Modularity 
should  enaure  wide  applicability. 


2.3  COMPUTING  ENVIRONMENT:  The  MIX'  Is  planned  to  be  able 
to  run  the  gamut  from  being  Insulated  from  human  control, 
serving  a  variety  of  sensors  and  effectors,  to  being  able 
to  accept  ground-based  human  directed  control. 

2.4  COMPUTING  OBJECTIVES:  Predicted  configuration 
scaleabl lity  primarily  under  internal  control  including 
systems  which  are  fault  tolerant  by  masking  redundancy,  by 
stand-by  redundancy,  or  by  software  checka;  systems  whoae 
use  of  power  is  variable  (but  whoae  thruput  ia  affected); 
and  syatema  operating  in  parallel.  The  major  objective  Is 
to  provide  swans  for  meeting  various  requirements  with  a 
high  degree  of  confidence. 

2.5  RELIABILITY  OBJECTIVES:  The  aystem  ia  to  be  designed 
to  meet  varying  specific  mis*.'—  reliability  objectlvea 
with  a  high  degree  of  certainty,  examples  are  survival 
for  n  years  with  a  probability  p;  "rail  operational,  fall 
operational,  fail  safe",  or  reliability  variab.,  e  wi  th 
mission  task. 

2.6  DYNAMIC  VARIABILITY:  As  stated  above,  dynamic 
variation  of  syatem  parameters  such  as  performance, 
reliability  and  power  consumption  with  confidence  in  t.i* 
deaign  as  a  major  ob(ectlve. 

2.7  PENALTIES:  Variable  with  mission,  ranging  from  loss 
of  human  life  through  expenalve  flight  hardware  tc 
abortion  of  flight  objectives. 

2.B  CONSTRAINTS:  Hardware  must  be  designed  to  fit  weight, 
power  and  sire  requirements,  yet  able  to  have  thruput 
compatible  with  miasion  requlrementa  and  to  support  the 
software  necesaary  for  reasonable  programming  effort  per 
mission, 

2.9  TRADEOFFS:  Hardware  efficiency  and  potential  thruput 
are  traded  for  1)  syatem  reliability  as  defined  per 
mission  phase;  2)  simplification  of  recovery  process  and 
other  basic  executive  functions;  3)  high  malfunction 
coverage  and  design  certification;  4)  ease  of  program 
validation;  5)  convenience  of  programming  and  ease  of 
diagnosis  for  external  equipment;  6)  system  flexibility. 


3.  DESCRIPTION* 

3.  I  ARCHITECTURE 

3.1.1.  CONFIGURATIONS 

3. 1.1.1.  INTERCONNECT  I VI TV :  Tie  basic  uniprocessor 
configuration  consista  of  partitiontH  computer  subunits 
attached  to  several  busses.  The  basic  subunits  are  (aee 
attached  rough  diagram):  ALU,  Scatch  and  Trogram  Control 
Unit,  Bus  Control,  1/0  Processor  and  Recovery  Control 
Unit.  Tie  bus  orientation  remains,  but  the  units  may  be 
modified  (microprogrammed)  for  varying  missions.  The 
system  consists  of  replicas  of  the  basic  subunlta,  with 
configuration  control  governed  by  the  RCU  and  Executive 
Program.  A  major  problem  is  the  Interface  design  to  meet 
the  constraints  of  fault  tolerance,  long  life,  and  varying 
modes  of  operation.  Tie  memory  is  encoded  with  a 
b-adjacent  error  correcting  code  and  spare  b  wide  subunits 
per  basic  module. 
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3. 1.1.2.  RANG*  :  The  range  of  the  system  la  not  frozen  in 
the  architectural  concept.  After  four  processors  the  law 
of  diminishing  returns  aeta  in  sharply  and  further 
partitioning  may  well  be  a  better  bet  for  long  life.  The 
memory  will  consist  of  modules,  each  module  consisting  of 
b-vide  units  with  b-adjacent  coding  and  spare  b-width 
units.  The  upper  limit  depends  upon  the  hardware 
available,  but  hardware  does  not  appear  to  be  critical. 

3.1. 1.3,  CAPABILITY:  The  ord.-r  of  10E5  to  10E6  additions 
per  second  per  basic  system  with  a  minimum  of  256K-512K  32* 
bit  words  of  memory.  I/O  will  be  handled  by  up  to  4  16- 
bit  parallel  channels  with  50,000  transfers  per  second 
simultaneously  on  one  input  and  one  output  channel.  The 
1/0  processor  will  handle  the  detaila  of  1/0  control  under 
direction  fro*  the  processor  Executuve. 

3.1.2  EXECUTIVE:  The  standard  executive  control 
..allocation,  scheduling,  dispatching,  1/0)  will  be 
achieved  by  replicated  software  routines.  These  tasks 
have  not  been  studied  much. 

3. 1.2.1.  MOOES  OF  OPERATION:  Each  processor  la 
multiprogranaable.  System  operation  includes  fault 
masking,  multiprocessing  with  hardware  fault  detection  and 
multiprocessing  with  software  anelysis.  The  mode  of 
operation  of  moat  concern  is  that  of  recovery  initiation, 
the  interaction  of  the  recovery  and  error  analysis 
programs  of  the  executive  and  the  RCU.  Recovery  and  audit 
programs  alwaya  run  background  whether  the  system  is  in 
fault  masking,  fault  detection  or  software  analysis  modes. 

3. 1.2.2.  SOrTVARE  ORGANIZATION:  The  aystem  software  will 
be  distributed  among  the  processors  and  analyzed  by  audit 
routines  for  early  detection  of  errors. 

3.2.  FAULT  TOLERANCE 

3.2.1  FAULTS  TOLERATEO:  In  the  error-masking  mode,  any 
number  of  faults  which  affect  only  one  partitioned 
sub-unit  can  be  tolerated.  The  syetem  handles  transient 
faults  with  instruction  retry  or  permanent  faults  with 
hardware  controlled  reconfiguration.  The  cause  is 
lrrelevent  as  long  ss  the  interface  detects  disagreement . 
The  disagreement  circuits  are  self-checking  so  faults  in 
them  are  detected.  Initially  the  same  malfunction  in 
three  units  is  necessary  to  defect  the  aystem.  After 
reconfigurations  two  faulty  units  may  escape  detection. 

In  the  eiror  detection  mode,  faults  causing  e  single 
subunit  to  be  in  error  are  detected.  At  this  point  the 
same  errors  in  two  units  will  be  undetected.  Olagnosla 
and  software  recovery  is  neceaasry  for  continuation. 

Faults  detected  by  software  checks  are  detected  and 
recovery  should  follow  in  the  unchecked  multiprocessing 
mode.  Faulty  software  may  be  detected  by  the  RCU  time-out 
tests  snd  system  evslustion  procedures. 

3.2.2.  TECHNIQUES:  In  hardware  fault  toleijnt  mode  the 
system  should  FO  -  FO  -  FS  for  each  one  of  the  partitions 
of  the  system  if  four  copies  of  the  basic  computer  are 
used.  Diegnosis  can  continue  the  computation  with  one 
partition  unchecked.  Detsiled  fsult  analysis  must  be 
performed  to  validate  such  goals.  In  hardware  fault 
detection  mode  the  syetem  should  run  at  least  two 
multiprocessor  hardware  checked  ayatems.  A  fault  would  be 
detected,  and  diagnoses  would  allow  continuation  with  one 
partition  unchecked  by  hardware.  Achieving  such 
hardware / f i  rtawsre /diagnosis  goals  depends  upon  the 
development  of  many  tools  of  fsult  analysis.  The  memory 
encoding  is  b-sdjacent  multiple  error  correcting  and/or 
multiple  b-adjscent  error  detecting.  The  codes  used  are 
variants  of  Reed-Solomon  codes  with  combinational  self- 
checking  trenslators  which  pass  only  correct  code  words. 
Standard  single  Instruction  retry  i«  available. 

Mlcrodlsgnostics  under  executive  program  control  with 
program  variable  input  patterns  will  be  used  for  fault 
snslysls.  The  executive  software  will  use  the  standard 
fsult  tolersnt  techniques  -  two  way  lists  with  pointer 
veriflcstion  before  proceeding,  stored  data  and  programs 
will  be  tagged  with  redundant  identification,  read  only 
progress  will  allow  simple  updating  etc.  Rollback  and 
restart  will  be  used  for  multi-procesaing  with  hardware  or 
software  error  detection.  The  RCU  ronltora  constantly  for 
cstastrophic  faults  -  those  not  detected  by  the  hardware 
snd  software  tests.  The  standard  time-out  testa  and 
system  performance  evslustion  routines  are  run  and 
controlled  by  the  RCU.  Power  is  conserved  under  program 
control  by  forcing  n  cycles  between  memory  accesses. 
Imposed  by  s  counter  with  program  changeable  contents. 


3.3  NOVELTY:  Reconfiguration  under  hardware  control  in 
fault  masking  mode.  Choice  of  computer  fault  masking, 
multiprocessing  with  fsult  masking  and  various  forms  of 
detection,  multiprocessing  with  hardware  error  detection 
by  comparison,  multiprocessing  with  software  error 
detection.  Storage  reliability  by  b-adjacent  mult.'p.e 
error  detecting  and  correcting  codes.  Seif  chocking 
memory  translstois,  checking  circuits,  and  error-analys la 
ciruits.  Use  of  power  under  program  control. 

3.4  INFLUENCES:  1.  JPL  Star  -  the  total  efiort;  2.  SRI, 
Techniques  for  the  Realization  of  Ultra-Relleolc 
Spaccbome  Computers;  3.  MIT  -Drsper  Lsb.  lor  spaccborr.e 
multiprocessors;  4.  Rapiz  emergence  of  LSI  for  feasibility 
of  much  redundant  hsrdwsre. 

3.5  HAP.D-C0RE:  Assuming  that  hard  core  means  hardware, 
redundant  nr  not,  whose  failure  will  produce  undetected 
errors,  there  is  no  such  hsrdwsre  in  this  system. 
Hopefully,  the  software  can  be  vsliJstcd  so  thst  equal 
claims  can  be  made  for  it. 


4.  JUSTIFICATION  TOR  THE  SYSTEM 

4.1  RELIABILITY  EVALUATION:  Architectural  reliability 
evaluation  by  Interactive  progtam  using  exponential 
failure  assumption  for  the  units.  Determination  of 
component  failure  rates  hy  analysis  baaed  upon  prevlus 
usta,  experience,  and  analysis .  Logic  fault  analysis  of 
circuits  in  design  stage  by  interactive  feult  simulation 
programs.  Diagnostic  pattern  evaluation  by  simulation 
programs.  Memory  failure  predictions  by  careful 
piobabi listic  fault  analysis  to  predict  error  patterns, 
programmed  computation  of  the  circuit  failure  constants, 
programmed  evalutatlon  of  reliability.  Programmed 
analysis  of  RCU  functions.  Theoretiial  analysis  of 
design,  with  hardware  and  software,  in  complicated 
situations  (guided  by  simulation). 

4.2  COMPLETENESS  01  EVALUATION:  Major  unsolved  problem. 

4.3  OVERHEAD;  Variable.  In  the  processors  about  a  3  1/2 
:i  logic  count  penalty  is  paid  (the  cost  is  much  less). 

In  the  memory  about  s  3*.2  storsge  penalty  is  pslJ.  In  the 
software  the  cost  is  unknown,  but  considerable. 

4.4  APPLICAJ1LI7Y:  The  concepts  can  be  used  elsewhere, 
the  system  is  oriented  toward  space  snd  extremely  high 
reliability  applications. 

4.3  EXTEKOABILITY:  This  computer  is  too  reliable  to  fit 
into  mo6t  other  syatetnu.  For  extension  some  of  the  fault 
tolereut  techniques  in  the  computer  must  be  eesed  for 
better  total  aystem  ualance. 

4.6  CRITICALITIES:  Multitasking,  as  with  all  Executive- 
controlled  recovery  systems,  Is  critical,  achieved  here 
with  mult  ip log raiding.  Multiprocessing  is  an  imposed 
condition,  but  snail  system  simplifications  would  result 
if  this  condition  were  reljxed.  Design  validation  tools 
are  critical. 

4.7  IMPLICATIONS :  Architects  must  perform  automated  error 
and  recovery  analysis  while  doing  system  sped f lcstlon. 
Human  analysis  is  too  fallible.  Hardware  designers  must 
have  and  use  tools  to  do  fault  anulysia  sa  they  design. 
After  the  first  pass  they  must  do  design  validation  and 
Iterate.  Software  designers  must  participate  in  the 
initial  decisions,  must  produce  more  techniques  lor 
producing  self-checking  pr  igrsns,  snd  must  produce  the 
toola  for  program  validation.  Applications  programmed 
must  validate  their  programs  (top  down  programming 
techniques  will  help),  ard  must  follow  system  rules  (not 
so  far  known) . 


5.  CONCLUSIONS 

5.1  STATUS:  This  system  is  the  collection  of  a  group  of 
ideas  from  a  research  projecr. 

5.2  EXPERIENCE:  None  to  report  to  date. 

3.3  FUTURE:  The  system  will  be  pursued  only  in  s  modified 
form  as  a  paper  atudy  only. 

5.4  ADVANCES:  The  problems  of  validation  -  hardware  and 
software  -  will  provide  many  a  bottleneck  for  fsult 
tolersnt  computing.  The  basic  problem  of  definition  of 
fault  tolerent  computing  will  be  with  us  -  do  we  consider 
any  algorithm,  procedure? 
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SURVEY  ON  FAULT-TOLEF ANT  COMPUTER  SYSTEMS 

Albert  L»  Hopkins ,  Jr.,  KIT  Draper  Laboratory 
Cartridge,  Maaa.  02139,  Fab  1973 


1.  IDENTIFICATION 

1.1.  NAME:  I  as  rt porting  on  a  long-term  development 
affort  which  ha*  baan  aupportad  by  diffarant  projacta  at 
dlffarant  tlaia.  Tha  following  titlaa  hava  b«en  used  for 
publlahad  raporta: 

*  "A  Fault-Tolerant  Information  Processing  Syutaa  for 
Advanced  Control,  Guidance, and  Navigation". 

*  "Spec*  Tranaportatlon  Syataa  Oata  Management;  Syataa". 

In  addition,  an  axparlmantnl  thraa-procaaaor 
thraa-acratchpad  breadboard  haa  baan  plven  tha  acronya 
CERBERUS  for  tha  three-heeded  dot  in  elaaeical  Mythology. 
Tha  eer  ‘Ay*  engendered  the  title:  Controlled  Error 
’ecovery  Behavior  Eeploying  Redundant  I'ea  of  Scratchpad*. 
In  what  followa,  I  uae  "the  ayatae"  to  mean  the  general 
concept,  rather  than  a  apeciflc  hardware  deal gn. 

1.2.  RESPONSIBILITY:  Thla  wort  la  in  the  Olgital 
Development  Group  of  tha  Charlaa  Start  Orapar  Laboratory, 
a  dlvlelon  of  M. I.T. 

1.3.  SUPPORT  SOURCES:  So  far  all  a  up  port  ha*  com  fro* 
the  NASA  Hannad  Spacecraft  Canter. 

1.4.  PARTICIPANTS:  KIT  and  NASA/MSC. 

1.5.  START:  Wort  in  thla  area  began  in  1966. 

1.6.  COMPLETION:  Open  ended.  No  end  item  la  scheduled. 

1.7.  BIBLIOCRAPKY: 


2.4.  COMPUTING  OBJECTIVES  FOR  TIE  CENTRAL  MULTIPROCESSOR: 
Variable  fro*  tha  order  of  10E5  (l.a.,  10  to  the  5)  to  the 
order  of  10E6  op  rat  ion  a  per  eecond,  with  *a*ory 
cepecltlee  of  fro*  2E14  to  2E17  words  of  aaln  ran  do* 
acceee  *a*ory«  Input-output  bandwidth  10E5  useful 
blte/eec  on  a  1CE6  pule*-p*r- eecond  bua.  Reaction  tie* 
order  of  10  milliseconds. 

2.5.  RELIABILITY  OBJECTIVES:  Various  types  of  objectives. 
One  exanvle  la  airline  applications  where  laaa  than  one 
catastrophic  system  malfunction  in  10E7  flights  Is  sought. 
Other  objectives  are  stated  in  tan*  nf  tha  nusfcar  of 
individual  malfunctions  which  can  be  tolerated  in  a 
flight,  such  as  "Pell  operational,  fall  operational, 
failsafe"  (ro-FO-FS).  Tha  ayata*  la  generally  *aant  to  be 
uaad  in  vary  high  reliability  applications. 

2.6.  DYNAMIC  VARIABILITY:  Graceful  depredation  la 
available  aa  a  Mans  of  exchanging  performance  for 
reliability. 

2.7.  PENALTIES:  In  the  Space  Shuttle  application,  aa  In 
poaalhla  aircraft  applications,  human  Ufa  la  concerned. 
Other  life-critical  opplicationa  can  be  easily  envisioned. 

2.8.  CONSTRAINTS:  In  Space  Shuttle  end  aircraft, 
approximately  2  cubic  feet,  120  lb.,  300  watte. 

(Estimate  for  a  central  multiprocessor) .  Other 
eppllcetione  may  be  more  or  leee  severe. 

2.9.  TRADEOFFS:  Hardware  efficiency  la  traded  for  I) 
eyetem  reliability,  2)  high  malfunction  coverage,  3)  eeee 
of  program  verification,  4)  eyetau  flexibility. 

The  nurter  of  feulte  tolerated  <a  variable  through  a 
combination  o'f  replication  and  aparlng.  Procaeaore  and 
memoriae  can  i  added  (delated)  to  lncresee  (decrease) 
proeaaainp  end  memory  resource*. 


*  R.  L.  Alonso,  A.  L.  Hopkins,  Jr.,  snd  H.  A.  Thaler, 
"Design  Criteria  for  a  Spacecraft  Computer",  Specaborne 
Multiprocessing  Sec? per,  pp.  2  3-28,  NASA  ERC,  Boston 
Museum  of  Science,  Oct.  1966. 

*  R.  L.  iloneo,  A.  L.  Hopkins,  Jr.,  and  H.  A.  Thaler,  "A 
Multiprocessing  Structure",  Digest  of  tha  First  Annual 
IEEE  Co^iuter  Conf.,  pp.  56-59,  Chlcsgo,  Sapt.  1967. 

*  A.  I.  Crssn  at  si.,  "STS  Data  Management  Syataa 
Design",  MIT  C.S.  Orapar  Laboratory,  Cartridge,  Mass., 
Report  E-2529,  Jwe  1970. 

*  A.  L.  Hopkioa,  Jr.,  "A  Fault-Tolerant  Information 
Processing  Concept  for  Space  Vehicles",  IEEE  Trane. 
Computers,  Vol.  C-20,  pp.  1394-1403,  Nov.  1971. 


2.  MOTIVATION 

2.1.  PURPOSE:  Real  time  control,  data  acquisition  and 
data  management. 

2.2.  PHYSICAL  ENVIRONMENT:  In  principle  It  coaid  b*  amy, 
but  airoapaca  applications  hava  predominated  la  deals* 
decisions. 

2.3.  COMPUTING  ENVIRONMENT:  Systevm  considered  Uv  are 
anvialooad  as  largely  aalf-coatalnad  la  format Isa 
processing  ays  tame  serving  a  variety  of  seaman  am* 
effectors  including  human  operators.  Such  mum  mould 
ba  distributed,  hierarchical  sad  rsdMdait.  Camera! 
fault-tolerant  multiprocessors  would  cosmmmlcat*  mtr 
serial  data  buses  to  local  processor  coaylaane  art added  in 
subsystems  of  tha  total  syataa.  A  orloclpsl  application 
considered  for  thla  approach  was  tha  Spmcs  Shuttle,  when 
tha  Orblter  would  hava  one  central  multiprocessor  with 
adequate  redundancy  and  spare  hardware  to  be  operational 
after  three  malftaictlone.  Each  subsystem  or  group  of 
identical  eubeyetesm  would  ba  served  by  elngle  or 
redwtoant  local  processors,  au  appropriate,  to  fulfill  tha 
redundancy  requirement  for  that  subsystem  or  group. 

Tha  Booster  stage  of  the  Specs  Shuttle  would.  In  this 
concept,  contain  a  eyetem  eleiller  to  that  of  the  Orblter, 
capable  of  coanucl eating  with  It  by  way  of  e  eerlel  bua 
correcting  tbs  two  central  multiprocessors.  All 
eonmunintlon  between  a  central  multiprocessor  end  its 
local  ;  rocs* a ors  would  ba  via  a  aerial  data  bua. 


3.  DESCRIPTION  0T  THE  SYSTEM 

3.1.  ARCHITECTURE 

3.1  1.  CONFIGURATIONS 

3. 1. 1.1.  INTERCONNECTIVITY:  Tha  ayataa  makes  extensive 
uae  of  replication,  and  consequently  connections  hava  a 
high  coat.  Serial  and  byte-aerial  buses  era  uaad  between 
basic  units.  Multiplexers  ere  employed  to  prevent  elngle 
unit  malfunctions  fro*  spreading  to  all  copies  of  a 
redundant  bua.  Tha  canonical  interconnection  schema  ia 
shewn  in  Figure  1. 

3.1. 1.2.  RANCE:  No  range  limits  have  bean  detaradoad,  but 
tha  following  nurtere  may  ba  typical  for  an  aarospeca 
application.  There  are  two  current  co^>etltlve 
conceptua licet lone  of  the  eyetem.  Tha* a  nurtere  represent 
the  newer  and  leas  wall  developed  concept. 

*  6»  Nurter  of  simultaneous  job  steps  in  process 

*  3-  Degree  of  replication  of  each  proceeeor-ecretchped 

*  3-  Nwoer  of  epers  proeeeeor-ecretchpeds 
*21-  Total  processor  scratchpad*  •6x3+3 

*  4*  Nurter  of  Independent  memory  blocks  of  16K 

*  >  Degree  of  replication  of  each  block 

*  3-  Nurter  of  epere  blocks 

•15"  Total  memory  block  modules  •4*3+3 

The  nurter  of  proceeeor-ecretchpede  and  memory  blocks  can 
be  increased  up  to  tha  practical  bandwidth  Halt  of  the 
proceeeor-Maory  bua  and  tha  I/O  bua. 

3. 1. 1.3.  CAPABILITY:  The  order  of  10E5  to  10E6  additions 
per  eecond  and  tha  order  of  2EI4  words  of  memory..  Three 
processors  would  be  the  smallest  "sensible"  nosfeer. 

3.1.2.  EXECUTIVE 

3. 1.2.1.  MODES  OF  OPERATION:  All  program  are  eegmtsd 
into  job  etape  which  are  dispatched  by  a  floating  form  of 
executive.  Each  job  step  occupies  one  processor  full  tin 
while  it  runs.  Multiprocessing  la  the  normal  operating 
mode.  HultlprograMdng  of  each  processor  la  not 
envisioned. 

3. 1. 2.2.  SOFTWARE  ORGANIZATION:  T  O  processing  la 
quaai-dedleeted  to  ooe  processor  triplet  (l.a.  It  can  float 
but  dost  to  only  when  nalfwctlon  makes  It  necaaaary) . 
Executive,  monitor,  and  reconfiguration  programs  are  run 
on  an  aa-needed  baalt  by  each  proems aor  triplet  ca  It 
finlahaa  a  job  step. 
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3,2,  FAULT  TOLERANCE 

3.2.1..  FAULTS  TOLERATED:  Iodlvidual  units  (e.g, 
processor,  memory  unit,  multiplexer)  can  malfunction  one 
•t  •  tins  with  no  restriction  on  what  tha  osturs  of  the 
malfunction  ia.  Errors  are  maakad  by  tha  system  tntll  it 
reconfigures  Itself  to  a  fault-tolerant  state. 


3*2.2.  FAULTS  NOT  TOLERATED:  Certain  malfunction  pairs 
which  occur  simultaneously  or  close  together  lo  tlw  can 
produce  loaa  of  data  and  many  require  a  program  restart. 
I o correct  specifications  or  program  malfunctions  can 
defaat  the  system.  Systematic  hardware  malfunctions  io 
which  the  earns  malfunction  occurs  lo  two  redundant  unite 
can  defeat  the  system. 


3.2.3.  TECHNIQUES:  “Wo  different  concepts. 

First  concept:  all  processors  are  duplexed  for  detaction. 
All  scratchpads  are  trlplsxed  for  masked  dta^>  capability. 
Slogle  Instruction  restart.  Graceful  degradation  of 
proceeeor-a  cratch  pad  groups.  Triplex  memory  units  with 
dedicated  eparee.  Triplex  bueae  with  eparee. 

Multiplexers  Isolate  buses  from  failed  groups  of  units. 


Second  concept:  procsaaor^ecretchpad  units  are  organised 
loto  groups  of  three  under  software  control.  Each  looks 
for  disagreement.  If  dleegreemsot  occurs,  contioue 
naming  to  end  of  Job  etep,  than  enter  reconfiguration 
program.  Cracaful  degradation  of  iodlvidual 
processor-scratchpad  units  (rather  than  groups  of  three 
scratchpads  and  two  processors  as  in  first  concept). 
Triplex  memory  units  with  non-dadlcatad  spares.  Triplex 
busae  with  spares.  Multiplexers  Isolate  buses  from  failed 
lo dividual  unite  (rather  than  groups  as  io  first  concept). 


In  both  concepts,  software  configuration  control  is  used, 
which  is  valid  as  long  as  a  working  processor  group, 

*•*00  group,  snd  bus-mult lplaxer  group  are  available. 
Multiplexers  participate  io  configuration  control. 

3.3.  NvVELTT:  Siogls  instruction  restart.  Abeeoca  of 
loterrupts  and  program  rollbacks.  Distributed  axmltor  and 
reconfiguration  functions.  Use  of  multiplexers  to  isolate 
bus  and  unit  malfunctions.  Fault-tolerant  clock. 
Hierarchical  system  with  fault  tolerance  extended  into 
subsystems. 

3.4.  INFLUENCES:  Rapid  emergence  of  LSI  memories  and 
processors  has  encouraged  usa  of  replication  and 
partitlonlog  with  simple,  identical  tmlte.  Apollo 
Cuidance  Computer  experience  promptad  elimination  of 
loterrupts  and  rollback  for  ths  sake  of  program 
verification.  Carter  and  Bourlclua  for  reliability 
models.  Avici  eels  for  concepts  of  fault  tolerance. 

3.5.  HARD  CORE:  Aseuadog  that  bard  core  means 

non- redundant  harpers,  than  is  oo  hard  core  in  this 
nyetaa.  Configuration  control  is  a  software  function 
using  the  avri labia  hertWere  to  configure  the  *.ya'tm. 


4.  JUSTIFICATION 

4.1.  RELIABILITY  EVALUATION:  So  far  mostly  geared  toward 
FO-FO-FS.  Some  Probabilistic  eoalyale.  No  reliability 
projections  as  jat  since  herd's  re  has  not  bean  selected 
and  failure  rates  are  therefore  oot  known. 

4.2.  COMPLETENESS  OF  EVALUATION:  Hardware  not  eelacted, 
hence  failure  rate  not  known. 

4.3.  OVERHEAD:  About  BOX  of  the  system  Is  devoted  to  tha 
achievement  of  fault  tolerance. 

4.4.  APPLICABILITY:  This  concept  ia  applicable  to  most 
digital  control  anvironmsota,  dependiog  on  tha  e  nomlca 
of  tha  application  r*]erding  fault  tolerance. 

*.5.  EXTtNDABILlTY :  Extendabt llty  probebly  does  not 
apply,  since  tha  eysttn  la  still  loosely  rpecifiad. 

4.6.  CRITICALITIES:  Tha  aye  tarn  la  moat  cos  t-ef  fact  lva 
compared  to  ojier  systems  when  tha  otnfcer  of  faults  to  be 
tolerated  la  high  and  where  ultra-high  reliability  la 
eooght.  Fo  single-fault  tolerance  and  last  high 
reliability  ,  tha  ayatam  configuration  might  ba  changed. 

4.7.  IMPLli  ATIONS:  In  an  ultra-high  reliability 
applicatloo,  apeci  fleet  Iona  and  programs  must  ba  proven  to 
be  correct.  In  thie  ayatam,  applications  programmers  must 
also  segmeof  their  programs  ioto  ihort  Job  a tape. 


5.  CONCLUSIUKS 

5.1.1  STATUS:  This  la  e  research  project  with  s 
breadboard  experimental  unit  almost  completed. 

5.2.  EXPERIENCE:  None  to  report  to  date. 

5.3.  FUTURE:  Some  parts  of  tha  ayatam  still  need  to  ba 
designed  and  prototyped.  Exeperlmeota  «ist  ba  conducted  on 
a  full-* cals  prototype  ayatam. 

5.4.  ADVANCES:  The  following  will  ba  beneficial. 

•Demonstrated  fiald  experience  with  varloua  fault-tolarant 
concepts. 

•Practical  techolquee  for  ganeratlog  correct  programs. 
•Practical  ways  of  verifying  that  a  program  la  correct. 

6.  COMMENTS 

The  questionnaire  was  good  in  tha  eenae  of  balog  thorough, 
but  lo  my  heats  to  respond  to  it  I  wonder  if  1  have 
omitted  significant  material.  An  additional  coMtnt  about 
this  ayatam  ia  that  it  has  bean  configured  around 
Integrated  processors  and  memoriae  which  resells  those 
that  are  available  today.  Tha  hardware  efficiency  owfcar 
glvsn  in  Section  4.3  ia  very  mialeadlog,  because  tha  coat 
of  tha  her  (N't  re  can  ba  tha  leaat  lap  orient  coat  of  tha 
ayatam,  if  tha  hardware  is  conventional  and  not  overly 
expensive.  This  ayatam  ia  expected  to  save  lo  coats  of 
ayatam  lntgegratlon,  program  verification,  and  operational 
reliability  experience.  Thaee  savings  may  ba  far  lo 
excess  of  tha  hardware  coat. 

As  an  additional  oote,  the  replicated  approach  used  hare 
gives  coverage  of  1.0  for  elogla  malfunctions.  Host  coded 
approaches  generally  give  lower  coverage,  difficult  to 
quantify,  and  often  impossible  to  verify  In  tha  fiald. 


taeegjor 
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P. ..Processor 

S... Scratchpad  memory 
H... Memory  module 
X.. .Multiplexor 

SSI .. .Subsystem  Interface 
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Slim  or  FAULT- TOLERANT  comfit  I  SO  systems 

V.  L.  Martin,  Hughes  Aircraft  Companv 
Fullerton,  California  926JA,  May,  1172 

1.  IDENTIFICATION 

1.1.  **ame:  Automat  lcally  Raconfl  fur  able  Modular 
Multiprocessor  Systems  (ARMXS) 

1.2.  RESPONSIBILITY:  Aatrlonlca  Laboratory,  Marshall 
Spaca  Flight  Center,  NASA 

1.3.  SUPPORT:  S«e  as  1.2 

1.4.  PARTICIPANTS:  The  part iclpat Inf  organisations 
Include  NASA  MSEC,  Hughes  (system  design),  MAS  Computing, 
Inc.  (executive  software  under  subcontract  to  Hughes); 
Auburn  University  (executive  control  approaches  under 
contract  to  NASA).  Principal  participants  by  name  -re  as 
follows:  NASA  -  Dr.  J.  B.  White,  She  man  Jobe;  Hughes  -  K 
L.  Martin;  MAS  -  T.  T.  Schansxan;  Auburn  -  Dr.  David 
Irvin. 

1.3.  START:  The  date  of  conception  was  circa  1968  in  a 
concept  docuien  .  written  by  Dr.  White.  MSFC  has  been 
developing  technology  under  their  Space  I’ltrarellable 
Modular  Computer  SUMO  progran  since  shortly  thereafter. 
The  svsteiB  design  effort  being  performed  by  Hughes  was 
solicited  lr  “av,  1171  with  a  contract  in  October,  1171. 

1.6.  COMPLETION:  The  Hughes  systen  definition  contract 
will  be  completed  In  April,  1173.  Construction  of  a 
breadboard  or  prototype  say  follow  with  completion  date 
uncertain. 

1.7.  BIBLIOGRAPHY :  Various  planning  documents  have  been 
written  at  NASA.  Dr.  White  aav  be  contacted  for  these. 

The  Hughes  effort  is  divided  into  three  phases,  with  the 
Phase  I  report  released  on  April  13,  1172.  It  Is  titled 
“Design  of  a  Modular  Digital  Computer  System",  DRL  i, 

Phese  I  Report,  Hughes  Alrcrsft  Company  FR  72-11-A30.  Two 
other  papers  have  been  submitted  for  publication.  Their 
fate  is  uncertain  as  of  yet,  but  Interested  parties  may 
obtain  roples  from  W.  1.  Martin  at  Hughea.  These  are  the 
following: 

•J.  L.  Brlcker ,  A  Unified  Method  for  Analysing  Mission 
Profile  Reliability  for  Standby  and  Multiple  Modular 
fU  dundant  Computing  Systems  which  alltms  for  Degraded 
Performance  (submitted  to  the  IEEE  Transactions  on 
Rel isbi 1 ltv  Theory). 

•J.  L.  Brtcker  and  V  L.  Martin,  Reliability  of  Modular 
Computer  Systems  with  •arvlr.g  Configuration  and  load 
Requirement*  (submitted  to  1172  IEEE  Computer  Society 
Conference) . 


2.  motivation 

2.1.  PURPOSE:  ARMMS  Is  to  be  applicable  through 
modularity  to  diverse  tvpes  of  space  mission*  ranging  from 
launch  vehicles,  to  space  stations  to  deep  space  probes. 

2.2.  PHYSICAL  ENVIRONMENT:  Spacebome 

2.3.  CtTPUTIRG  ENVIRONMENT :  See  2.1,  2.2 

2. A.  COMPUTING  OBJECTIVES:  The  mot  1  rating  computing 
objective  is  to  be  able  to  conflgur  systems  which  are 
fault  tolerant  through  TMR  or  other  redundant  modes  or  to 
use  the  modules  in  parallel  for  high  coafput  ing  capacity 
and  to  be  able  to  reconfigure  from  one  type  to  the  other 
dynamical ly.  Maxls-xs  capacity  In  a  non- redundant  mode  Is 
to  be  "several  million"  additions  per  second. 


2.3.  RELIABILITY  OBJECTIVES:  One  specific  reliability 
objective  le  that  the  probability  of  survival  of  at  laast 
a  simplex  computer  after  5  years  should  be  at  least  0.1* 
(with  no  on-board  maintenance) .  The  overall  Intent, 
however,  la  that  the  system  should  be  able  to  be 
configured  to  meet  specific  mission  reliability  objec.lves 
whether  they  be  stated  in  terms  of  maximum  recovery  «l«e. 
masher  of  failures  tolerated,  etc. 

2.6.  DYNAMIC  VARIABILITY:  As  noted  in  2. A,  dynaml. 
variability  of  configuration  la  one  of  the  primary 
motivations. 

2.7.  PENALTIES:  Sec  2.1. 

2.8.  PHYSICAL  CONSTRAINTS:  There  are  no  explicit  physical 
constraints  except  those  implied  by  the  natur-t  of  the 
Intended  spaceb  me  application.  However,  ar  implicit 
physical  constraint  is  the  difficulty  of  cor.trlvlng  ar 
approach  to  a  large  (by  aerospace  standards)  computing 
capability  fault-tolerant  design  within  the  confines  of 
weight  and  power  budgets  which  may  prevail  for 
interplanetary  missions. 

2.9.  TRADEOFFS:  At  the  current  stage  t f  the  design,  there 
are  many  critical  tradeoffs  yet  to  be  made. 

*  For  a  computer  which  will  be  built  after  1973,  what 
device  complexity  and  failure  rates  she  d  te  sssimed? 
Almost  all  aspects  of  the  design  are  critically  affected 
by  this  question.  Some  of  the  more  crucial  ones  are  the 
rssxlmisa  complexity  of  any  module;  the  degree  to  which 
processors  must  be  suh-pert Itloned;  the  resulting  cost  In 
switching  hardware;  the  maximum  mmber  of  replicates  of 
any  one  module  tvpe  which  muet  be  accoamodated;  and  the 
complexity  of  the  configuration  control  software. 

*  The  basic  ARMMS  concept  developed  by  NASA  Incorporates 
e  dedicated  executive  module  rather  than  s  floating 
executive.  Resulting  tradeoffs  Include  specific 
definition  of  functions  to  be  performed,  specification  of 
status  monitoring  and  reconfiguration  parameters,  and  a 
design  Approach  which  yields  sufficiently  high  reliability 
for  the  executive  module. 

*  The  system  architecture  is  not  vet  defined  In  any 
complete  sense.  Questions  yet  to  be  resolved  Include 
•pec  flc  definition  of  -llowed  modes  of  operation; 

de f  1  .ilt Ion  of  the  means  of  interconnecting  the  modules; 
placement  and  use  of  voters;  use  of  error-correcting  codes 
for  memory  data;  maximum  ms*her  of  replicates  per  module 
class;  specific  techniques  far  memorv  date  protection;  and 
fault  tolerance  features  within  each  module  class.  At 
present,  we  are  making  tradeoffs  based  on  two  major 
conf lguratlon  alternatives.  Although  few  tradeoff 
conclusions  have  been  reached,  the  predominating 
evaluation  criteria  are  almost  certain  to  be  the 
following: 

*  Implementation  feasibility  -  Any  design  feature  which 
doe*  ot  seem  to  ue  to  be  feasible  In  any  major  sense 
(e.g.,  pin  count,  excessive  power,  design  cost)  will  be 
rejected.  We  are  not  particularly  Interested  in 
developing  new  theories  or  techniques  of  fault-tolerant 
coaqiuting  but  are  verv  Interested  in  developing  a 
much-needed  testbed  based  on  the  research  performed  over 
the  last  5-10  years. 

*  Suitability  to  the  multi-node  configuration 
requirements  -  ARMMS  Is  Intended  to  be  usable  in 
configurations  ranging  from  a  simplex  computer  to  TMR  with 
«tai<dbv  spares.  Any  feature  which  Imposes  excessive 
overhead  cost  for  the  benefit  of  one  configuration  at  the 
expense  of  others  is  suspect.  For  example,  added  hsrdwsrs 
per  module  for  Internal  fault  tolerance  multtpliea  the 
hardware  pene’ty  p.«'i  lr  TMR  mode. 
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3.  SYSTEM  DESCRIPTION:  A*  seen  fro*  the  tradeoff 
discussion  above,  no  fins  syetee  description  Is  possible 
new.  Thtrefore,  the  responsts  In  this  section  are 
necessarily  brief  and  Incomplete. 

3.1.  ARCHITECTURE 

3.1.1.  configuration 

3. 1.1.1.  INTERCONNECT! v'lTT :  All  processors,  1/0,  and 
executive  controller  nay  access  all  of  main  nenory  (a 
study  of  the  desirability  of  Identifying  in  additional 
level  of  nenory,  cache  or  teak  oriented  was  made,  with  a 
negative  conclusion  reached;.  The  noet  probable  scheme  is 
a  system  of  replicated  busses  with  access  control  governed 
by  the  executive  nodule.  The  narurc  of  spaceborne  I/O 
activity  Is  biasing  us  toward  a  direct  processor  I/O  data 
path  which  can  be  used  for  transmitting  short  bursts  of 
data.  The  executive  controller  will  monitor  the  other 
system  modules  'la  a  time-shared  bus.  This  bus  ordinarily 
polls  the  nodules  in  sequence  but  a uv  be  interrupted  by 
the  processors  on  task  completion  or  other  time-critical 
event.  Ho  direct  interaction  of  modules  of  a  given  class 
(••g.,  proreasor-to-precessor)  Is  planned. 

3. 1.1.2.  RANGE:  The  general  approach  to  achieving  the 
large  v  sped  tv  as  ..weed  previously  is  to  maximise  the 
Individual  proces- .ir  performance  so  that  throughput  is  not 
dependent  on  a  large  nmber  of  parallel  instruction 
streams.  (Three  Is  a  desirable  upper  limit.)  The  maximal 
main  memory  capacity  Is  to  be  large  enough  (e.g  , 

256R-312K  words)  to  support  the  high-throughput  goals. 

The  word  length  Is  to  he  32  bits  as  dictated  by  the  choice 
for  the  NASA  SIMC  processor,  emulative  1/0  data  rate 
capability  is  to  be  10  million  bits  per  second.  In  all 
cases,  maximum  number  of  modules  per  class  (and  the  memory 
module  capacity)  will  be  determined  primarily  by 
reliability  considerations.  A  least  upper  boind  Is  4  for 
each  claaa. 

3.1. 1.3.  CAPABILITY:  (See  2.4.) 

3.2.  FAULT  TOLERANCE:  (The  system  is  still  too  much 
conceptual  to  allow  a  decent  response.  All  faults  are  to 
be  tolerated.  None  are  to  be  not  tolerated.  All 
techniques  will  be  considered.  Ask  again  In  a  year  and 
let’s  see  ..r.  *t  turned  out.) 

3.3.  NOVELTY:  On  the  one  hand,  there’s  nothing  that  one 
can  point  cut  as  being  fundamentally  novel  (this  is  true 
of  moat  machines,  1  think).  On  the  other  hand,  there  are 
no  marlines  that  I  know  of  that  have  successfully 
implemented  a  variable  redundancy  approach  such  as  Is 
being  sought.  The  choice  of  a  dedicated  executive  module 
is  the  only  deviation  at  the  Mock  diagram  level  fro* 
other  mult tpre cesser*  (but  this  module  ts  a  rather  clime 
parallel  of  the  TAR?  la  STAR). 

3.4.  INFLUENCES.  TFl  STAR;  NASA  ERt  Nodular  Computer; 

NASA  NS  Ft  SIMC;  IW,  "Architectural  Studv  tor  a 
Self-Repairing  Computer;  SRI,  Technique*  for  the 
Realisation  of  lit ra-Rel table  Spaceborne  Computers. 

3.3.  H4KKPU:  The  executive  module  is  hard-core.  The 
effect  is  to  be  minimised  by  simplifying  the  module  as 
much  as  possible  and  hv  Interna’  redundance  (which  may 
ultimately  result  is  replication). 


4.  JCSTTFTCAflv* 

4.1.  RELIABILITY  EXALlATl  To  date.  relUMUtv  has 
bees  evaluated  solely  Sv  analysis  < *«  described  la  the  two 
papers  mer 1 1 omed  la  1.7).  Later  in  the  effort,  we  expect 
to  extend  the  analvais  to  Include  coverage  aod  switch 
unreliability.  We  also  expert  to  simulate  the  logical 
performance  of  th#  intermodule  switches  and  to  simulate 
the  injection  of  faults. 

4.2.  COMPLETENESS  OF  EVALtAT J«NS:  l*n  mot  smre  that  I 
under* tard  the  question.  Rut  whatever  pmm  mean  bv  design 
eval  mat  lorn,  I’m  smre  that  I  wish  we  had  more  time  and 
money  te  dr  it  better. 

*.  A.  iTWir  :  Slate  the  esmf  tgnrat  tom  is  dynamic,  the 
percentages  mf  ttswatti  attributed  to  the  achievement  of 
fawit-tmlerence  alee  ear*  with  tins.  An  spper  limit  ts 
probably  W;  a  lower  limit  is  probable  20!  (in  cost, 
logic,  execution  tins.  etc.). 


4.4.  APPLICABILITY:  Applicability  to  othar  than  apace 
applications  is  questionable. 

4.3.  EXTENDABILXTY:  I  think  that  it  is  note  likely  that 
the  system  design  can  ussfully  contract  than  that  it  can 
be  usefully  extended. 

4.t>.  CRITICALITIES:  The  major  difficulty  of  the  design 
Is  the  breadrh  of  the  goals.  The  critical  problem  la 
therefore  to  find  a  set  of  design  choices  which  complies 
reasonably  well  with  all  the  goals  (s.g.,  we  want  high 
speed  and  capability  but  r  qu< re  low  weight  and  power). 
However,  I  don’t  think  that  eiight  changes  would 
critically  affect  the  design.  (Also,  ax  a  side 
observation,  while  one  la  in  the  aid at  of  a  system  design, 
all  choic.s  seem  critical,  don’t  they?) 

4.7.  IMPLICATIONS:  (Let  me  plead  that  this  question 
seems  too  vague.  I  don’t  know  where  to  start  with  a  br’.af 
response. ) 


3.  CONCLUSIONS 

3.1.1  STATUS:  Th*  status  la  sufficiently  described  by  the 
above  comments,  1  think.  In  stamaclon,  we  are  about 
one-third  of  the  way  through  a  system  definition  phase. 

3.2.  EXPERIENCE :  It  appears  that  component  technology  la 
contributing  more  to  the  feasibility  of  highly  reliable 
machines  than  architecture  concepts  are.  As  recently  as  2 
or  3  years  *go,  gate  failure  rate  of  10E-7  per  hour  seemed 
optimistic.  At  present,  gate  failure  rates  of  10E-10  per 
hour  are  credible  for  the  spare  environment.  On  the  other 
band,  the  assumption  that  dormant  failure  rates  are  a 
small  fraction  of  active  failure  races  appears 
questionable.  For  a  long-life  machine  in  an  unmanned 
environment ,  them#  two  factors  are  of  rajor  significance 
to  the  svstem  designer. 

3.3.  WYLIE:  There  are  two  conflicting  possible  futures 

o’  AWT?..  The  pmmeiaietic  view  Is  that  It  will  go  the  way 
ot  10  .i  i:  similar  paper  design  effort*  and  will  die  with 
only  a  ma.  .•meet  to  emmmsmerato  its  non-ex 1 stance.  The 
optimistic  t*  that  it  will  appear  sufficlsntly 

la  ; mm*,  .—at  NASA  will  continue  Its 
vw  am*--  *N  *  smtual!*  attach  It  to  a  mission. 

'  *  *rs*  **tag  directed  toward  the  optimistic 

alter,  mtims. 

...  ajvok.:  aamst  add  anything  to  the  lints  of 

thmrret  .  • .  pei*e  **  mm  and  needed  areas  of 
leve-  gar  v  ei  5»1  described  in  if  reports 
extract  NAS  12- JL.  I*  particular.  I  agree  that  there 
base  *m  toe  :m  came  studies  which  cam  be  eualmatee. 

A  am ‘or  practical  advance  which  1*  needed  Is  the 
identification  and  -plcltatico  of  specific  applications 
in  which  fault -tolerant  nacMnes  can  be  Justified 
economically.  It  is  stenlflcant.  1  think,  that  the  Bell 
ES<-I  and  System- 3*0  FLT’*  instruction  retry,  etc., 
represent  the  most  extensive  application  of 
fault -tolerance  and  diagnostic  techniques.  Noth  are  In 
areas  where  the  pavoff  for  high  reliability  la  great. 
Although  aerospace  applications  have  supported  much  of  the 
research  la  fauli-itlttmi  ucMxn,  1  am  shape  leal  that 
there  lx  a  sufficient  name  of  acmey  there  tc  lead  to  very 
widespread  results  in  fielded  system*.  The  situation  is 
analogem*  ts  that  which  has  existed  far  associative 
processing  for  10  years,  la  that  the  |!»o«r,  concept* . 
and  technique*  are  often  apparent  out  coat  •  *«>sl derations 
ult taste lv  lead  to  more  conventional  choices. 

Also,  I  wonder  if  "fault-tolerant  computing"  is  too  narrow 
a  view  and  that  many  of  the  basic  ideas  would  be 
applicable  to  a  discipline  of  "Fault-tolerant  systems". 
Perhaps  there  are  other  equally  fertila,  but  leas  plowed, 
fields  to  he  conquered. 


h.  COMXTSt  (See  3.4) 


RlMMf'Aew*  wKvmi 


S'JFVn  DF  FAULT -TOLERANT  COMPUTE  I!  SYSTEtS 


!'  l“11,r>  Totermetrlea,  Inc.,  701  Concord  Avonut. 
Ceabrldge,  Massachusetts  02138,  March  1973. 


1.  IDENTIFICATION 

1.1.  MAH:  Ihe  men  la  rafarrad  to  aa  althar  tha 
lntaraatrlca  Multiprocessor,  or  tha  Space  Station  Coaputar. 


1.2.  RESPONSIBILITY :  Iotermetrlee ,  Inc. 

1.3.  SUPPORT t  NASA  Maraud  Spacecraft  Center,  Houeton,  Tea. 


1.*.  PARTICIPANTS!  J.  S.  Millar,  V.  H.  Vandever,  S.  F. 
Stantan,  A.  E.  Avakian,  and  A.  L.  Roaaala. 


1.5.  START!  Tho  project  baton  In  June,  1969,  and  continued 
for  ten  nontha.  After  a  thlrteen-aonth  period  of 
Inactivity,  tha  dealfn  effort  vaa  reauned  In  Jay ,  197!. 


1.6.  COMPLETION!  tha  aecond  pheaa  of  tha  dealgn  vaa 
collated,  and  a  report  publlahad.  In  September,  1972. 


1.7.  BIBLIOGRAPHY! 

«J.  S.  Millar,  D.  J.  Llckly,  A.  L.  Koenala,  and  J.  A. 
Saponaro,  Final  7  >port— Multlprocaaaor  Computer  Syetar 

,ne*»  Caabrldta,  Haaa.,  March,  1970. 
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«J.  S.  Millar.,  V.  H.  Vandavar,  S.  F.  Stantan,  A.  E. 
Avakian,  and  A.  L.  Koanela,  "Final  Papo-t—  Unglnaarlog 
Study  for  tha  Functional  Dealgn  of  a  Multi]  roceaaor 

19*!!*  N7yI'o235trlC*’  InC"  C‘*‘rU:'t‘  H-«..  Soptart.r. 


*J.  S.  Miller,  and  W.  H.  Vandaver,  "Dealgn  Feetur#*  rf  *n 
A*  roe pace  Multlprocaaaor",  Internetlorul  WorWKp  on 
Coaputar  Archltectura,  Juna  26-2B,  1973.  CtMobU,  FTanca. 


2.  MOTIVATION 

2.1.  PURPOSE:  TTia  ayataa  la  orlrntwd  toward*  tha 
tinerel-purnoe*  computational  requirement*  of  a  wanned, 
ortitiog  apaca  at  at  Ion  oi  about  tha  19  00  time  parlod.  lta 
txpactad  uaaa  lnclud*  raai-tlw  atatlon  control  and  data 
acqulaltlon  function*,  plua  Interactive  and  batch  data 
proca JMng  operation*. 


2.2.  PHT51CA*.  ENVIRONMENT :  The  primary  alee Ion  for  which 
tha  ay* tea  wa*  deelgned  la  a  apacaboma  ona.  However,  tha 
*?  H*!t#d  to  th*  functional  -pacification*, 
and  th*  phyalcal  environment  conal derat  Iona  have  had  llttla 
inpact  on  th*  configuration. 


2.3,  COMPUTINC  EWVlRONfCJr ;  All  connect  ,  d  element*  will 
b.  aboard  tha  apaca  atatloo.  Co**on^'  of  tha  computer 
-111  ba  Interconnected  by  dedicated  b  aa.  Terminal*, 
dlaplaya,  and  aenaora  vlll  be  attach- J  to  the  ayataa  bv 
aeana  of  a  nultlplenad  data  bue.  7 


2. A.  COMPUTING  OBJECTIVES!  Tha  parforuance  requlreimnte 
vara  rather  aoft.  General  objactlvae  choeen  for  the  ayetea 
vara  real-time  raeponaa  of  5  Ulleecond.  or  bettor,  and  an 
equivalent  processing  rata  of  tvo  ml  lion  addltlone  par 

;°7  *  'braa-frocca.or  configuration.  Conflyuratlon 
fltklblllty  vaa  an  Important  objective  of  the  deelpn. 


2.5.  RELIABILITY  OBJECTIVES:  Be.auaa  no  hardvara  vaa 
designed,  no  apaclflc  reliability  raqulranenta  ware 
Imposed.  Tha  use  of  tho  eyetea  aa  tha  central  co->utar  for 
tha  apaca  atatlon  Ufe-aupport,  trajectory,  attitude,  and 
fun'tlon*  f’1*c**  heavy  ee.he.1.  upon 
a  dealfn  which  allova  continued  operation,  even  If  ct 
reduced  parforaanca,  In  tha  preaenct  of  faulta.  Although 
It  la  eapected  that  brief  outagaa  of  tha  eyater  vlll  be 

!?,!£*.»  °»f  •‘I'0"*  "»*•  b-i>  directed  et  evoldenc.  of 
all  alngle-polnt  fallor*  mode*. 


VABIABILITT.  The  -tit Ip  11  city  of  nroc.e.or. 
le  utilised  to  continue  operetion  vhen  proceeen  fell 
Fnllocaa  of  procaaeore  thue  reduce  the  peek  pto  elnv 
copneltj-.  Similarly,  e  ntuory  — iltlpleidng  eche,  permits 
proara.  tod  data  mobility  to  vork  around  loaa  of  ramory 
umlta.  lac' eased  multlpls.log  activity  vhlch  follova 
ramjvalof  my  umlta  from  aarvlca  alao  degrade.  maximum 
parrm— il*  Lrmli.  Whether  failure*  eauaa  actual 
«pr»Jjtl«  la  acnrlca  depend*  upon  the  anoint  of  exceaa 
capacity  that  vaa  provided. 


2.7.  PENALTIES:  Pamaltlas  from  faulty  operation  era 
difficult  to  aaaaaa  thla  .«rl7  ln  the  apace  atatlon 
Planning.  Howarar,  loaa  of  Ufa  la  conceivably  poaalbla. 
but  .el lore  to  achieve  «f  Jalon  objactlvae  la  a  more  llkalv 
reeult  of  nalparf orman ca.  y 


2.0.  CONSTRAINTS:  The  conetralnta  imposed  by  the  apace 
atatlon  environment  will  Influence  hardwora  dealgn,  but 
have  not  affected  tha  factional  dealgn  appreciably. 


Hi!  •*p*a#d  co«Pon*nt  and  connection 
reliability  will  drive  dedalona  relative  to  the  level  of 

:r:^PrOC,MOr:,-^»  and  bu*  «P*iHty  to  be  provided 
to  achieve  overall  ayataa  availability  goala. 


3.  DESCRIPTION 

3.1.  ARCHITECTURE 

3.1.1.  CON1  TCU  RATIONS 

3.1. 1.1.  TNTEFC0NNECT1VITY:  The  baalc  configuration  la 
ahovn  In  Figure  1.  The  Internal  configuration  of  a 

P roceaaor  taiit,  ahoving  duplicated  elamani  a  and 
comparator*,  la  given  In  Figure  2. 


3.1. 1.2.  RANCF:  Three  to  eight  proeeaaor*  with  at  least  aa 
memory  module*  aa  procaaaors,  and  prafarably  mora  to 
diminish  conflict  frequency.  Cl van  an  environ  sent  where 
error  detection  was  Important  but  error  recovery  beyond 
inatnictlon  retry  was  laaa  important,  a*  ftw  as  ona  of  asch 
module  can  'ora  a  ayataa. 


3.1. 1.3.  CAPABILITY!  The  effective  co— rating  pover  of  e 
one-processor  system  le  ebout  0.6  Mips,  or  the  epproidsete 
equivalent  of  s  360/65.  p 


3.1.2.  EXECUTIVE 

3. 1.2. 1.  MODES  OF  OPERATION!  Softvare  execution  le  based 
on  e  three-orlorlty  dlepatchlop  olgorlthm;  of  highest 
priority  ere  the  functions  vhlch  require  reel-tlm 
response.  These  functions  ere  kept  short.  Middle  priority 
oroceseee  mtv  be  longer,  but  an  Interrupted  only  by 
real-tine  proceasea.  Long  batch-type  processes  are 
assigned  Iove.t  priority,  end  effectively  run  ln  s 
backgroiaid  mods. 


3. 1,2,2.  SOFTVARE  ORGANIZATION!  The  system  software  can 
run  on  any  or  multiple  procaaeore.  Crltldal  aectlcna  are 
protectod  by  Interlocks  to  avoid  disruptions  due  to 

lnt"for*n“-  «d  rues very  eoftv.r. 

la  stored  ln  duplicate,  undxr  the  hsrovirm-1  uewnted 
Information  protection  acheis  outlined  bale.  Thus,  single 
faults,  even  thoi.  vhlch  disable  an  entire  ssnory  module, 
cannot  prevent  access  to  or  operation  of  the  system. 


3.2.  FAULT  TOLERANCE 

3.2.1.  FAULTS  TOLERATED!  Eaprsssly  attempted  In  the  deelgr 
end  recovery  from  every  single  feult.  In  thl. 
contest ,  e  second  feult  le  one  uhlch  occurs  before  the 

r:cW*rT  *ctt<m*  b«"  b**n  computed  for  the  first 
fault.  Processor.,  mvnorlss,  end  buses  may  be  removed  fro. 
operation  as  a  result  of  faults.  Porfotmen-e  capability  Is 
correapondlngly  reduced.  ^ 


•LLdTr”  0  TOLERATED!  Synchronised  double  faulta  lo 

whlST^ff  .  }T  ?"  n0t  toUr«*d-  Memory  faults 
Info.sstlon  for  vhlch  dupllcst.  storage  vse 

hu'Lrn*Cr;,t7  T  T  ,0l,r,t,d-  *Tthough  ,h.  .1..  of 
this  set  o.  Inform, Ion  la  totally  under  user  control. 

Flaws  In  system  software  may  not  ba  tolerated,  but  faulta 
In  applications  software  may  adversely  effect  other 
procesnoa  only  throurfi  disruption  of  dote  they  ahsre. 

The  nerdv.re  aupporte  detection  of  in.tructlon  loops, 

■  ubecrlptboieida  violation,  eeceeelve  processor  tie/ usage 
end  orertlme  Inhibition  nf  Interruptions,  * 


3.2^3.  TECHNIQUES!  Evror  detection  end  recovery  ere  bsssd 
on  redtsidancy  of  Information  and  cepeblllty.  Processor 
unit,  are  comprised  of  duplicate  elements,  whose  .sternal 

difiraJJI”  C0*P*r*li-  This  •PP'rc*1*  mexled.ee  th.  error 
detection  coverage,  end  alnlalaea  the  extra  dealgn  needed 

Ut.«£T  7  *!'•  JJ**071**  •“  PACity  Sacked  for 

,  "J  ‘nfooAtion  whose  lose  cmvnot  be  tolerated 

l"  duPllc*te.  Proeeaaor  local  storaga  (Ml)  la 
ell  duplexed.  Information  in  main  mamory  (M2)  1* 

by  “*"  A  novel  ue.  of 

.Jr?  atorege  untggatnt  .Hove  euch  duplication 

to  be  impleaenv.d  entirely  vithln  the  hardware.  Softv.ru 
may  be  vrlttei  vlthout  need  for  consideration  of  the 
detection  or  recovery  problera.  Duplicated  buees  eupport 


A2.18 


II 


■IWP 


coaparison  betwaan  indapendent  copies  for  coeprahsnsiva 
arror  detection  in  the  cases  where  duplirate  atoraga  la 
a  pacified,  and  for  error  detection  of  transfare  oiharviaa. 
The  iip  lament  at  ion  of  duplication  in  M2  caueea  copiea  to  be 
kept  io  distinct  units,  so  that  aveo  catastrophic  failure 
of  an  entire  unit  can  ba  tolerated.  The  memory  multiplexing 
techoique  incorporated  to  reduce  the  total  aw>unt  of  M2 
required  for  a  given  perforvaoca  level  allows  program  and 
data  segments  to  ba  moved  to  new  locatiooa  whso  M2  falluraa 
occur. 

3.3.  NOVELTY:  With  raepect  to  fault-tola;*ncs ,  a  major 

attempt  has  baen  made  to  laolata  application  software  from 
the  affects  of  undetected  hardware  arrors  (by  detecting  ea 
many  aa  feasible),  end  from  the  necessity  to  davote 
explicit  atteotlon  to  survival  following  detected  arror  (by 
providing  adequate  hardware  support).  With  respect  to 
archltacture,  a  high-level  instruction  set  has  baen 
developed,  tailored  to  the  nesde  of  high-order  language 
compilers,  which  are  to  be  used  to  prepare  all  tha  software 
executed  in  tha  system.  A  novel  approach  to  a 
dearrtftrtr-haaed,  tagged-word  design  has  bar?'  faker ,  lr 
which  the  Multlcs  paging  strategy  has  been  applied  for  the 
first  time  to  variable-site  pages  (Burroughs*  segments),  a 
unified  atack  data  format  has  been  utlllted,  and  tag-bits 
are  ineonprated  into  all  nereaaeff  l  e#  a  r^af  F 

at  moat  one  bit.  Moat  words  in  tha  ayatsm  naad  not  axpend 
blta  on  tags.  Variable- length  instructions  are  us  ad  to 
achieve  maximum  conciseness  of  program  code. 

3.4.  INFLUENCES:  Tha  major  lnflueoce  on  inatructlon- 
f  jrmat  and  stack- jrgar.laed  processing  cams  from  tha 
Burroughs  B6700.  The  emphasis  on  hardware-implemented 
error  detection  and  recovery  resulted  from  adverse 
experience  in  attempting  to  provide  these  capabilities 
through  software  in  the  Apollo  on-board  guidance  computer 
software  development. 

3.5.  HAND-CORE:  Tha  only  hard-core  a  lament  io  tha  ays  tan 
la  tha  I/O  controller.  Because  no  degraded  level  of  I/O 
capability  easned  tolerable,  tha  1/0  controller  la 
implemented  with  high  internal  redundancy,  to  aa  to  ba 
"failure  proof". 

4.  JUSTIFICATION 

4.1.  RELIABILITY  EVALUATION:  Fsult-toleranca  la  assessed 
by  thought  experiments,  rather  than  simulations  or  other 
mechanical  means.  Reliability  estimates  cannot  ba  made 
until  hardware  dsslgn  cosnences. 

4.2.  COMPLETENESS  OF  EVALUATION:  Evaluation  constate  of 
mental  exercises.  Mora  rigorous  exploration  must 
necessarily  await  hardware  design. 


O^ratln?  M^nor*  Modulo. 


4.3.  OVERHEAD:  Because  the  hardware  iw.  *ments  the  bulk  of 
the  feu It- tolerance  provision®,  very  little  price  ia  paid 
for  thla  capability  In  tarna  of  perforaaoce.  Segregation  of 
proceaaor  local  atoraga  io  Ml  unite  to  allow  an  alternate 
procaaeor  “o  rescue  a  failed  one  at  any  point  io  tha 
execution  of  an  Instruction  langthana  transit  times 
somewhat.  Tha  hardware  coat  consists  of  e  factor  of  two  in 
processor  coats,  plus  a  bit  for  coaperison  circuits  and 
error-control  logic.  Memory  coats  are  locrcaeed  by  tha 
amount  that  duplication  of  aalactad  data  requires  extra 
storage  capacity. 

4.4.  APPLICABILITY:  The  system  described  la  applicable  to 
any  application  where  fault-tolerance  la  Important.  The 
emphasis  on  real-time  capability  makes  It  especially 
suitable  for  aircraft  or  process  control  applications. 

4.5.  EXTENDABILITY:  Performance  can  ba  Increaaad  by  use  of 
faster  components;  memory  sixes  may  be  locrcaeed,  etc.  It 
la  not  believed  that  additional  eiphasls  on  fault- tolerance 
would  be  particularly  productive. 

4.6.  CRITICALITIES:  The  absence  of  a  rigid  sat  of 
requirements  has  allowed  a  reasonable  trade-off  between 
conflicting  factors.  The  design  has  been  driven  strongly 

ily  ly  the  tsuli.  cotciince  requtrewnita. 

4.7.  IMPLICATIONS:  Tha  system  le  dei  gned  crowd  tha 
concept  that  ell  software  for  the  machine  will  be  produced 
by  correct  compilers,  which  participate  lo  the 
Implements tioo  end  eoforcement  of  operating  system  and 
prdgtMdng  growd  rules,  uee  of  high-level  language  for 
all  software  development  la  increasingly  recognised  aa  a 
valuable  means  of  reducing  software  costa.  However*  the 
further  advantages  which  can  be  achieved  by  Intimate 

*  wnwi-1-*  Im  t  eBiiebl  tomgJbci  mu  Be  gmuaiefcloi  and  cyctne. 
raquirameote  have  not  beso  exploited. 

5.  CONCLUSIONS 

5.1.  STATUS:  The  functional  design  le  complete. 

5.2.  EXPERIENCE:  The  design  experience  has  been  coaplately 
positive  to  data;  the  objectives  end  tha  approach  continue 
to  appear  valid. 

5.3.  FUTURE:  The  project  is  nov  dormant  doe  to  NASA 
amphasle  on  ths  specs  shuttle  p.ogram,  which  has  caused 
epsce  station  planning  to  ba  > savlly  curtailad.  Other 
sources  for  support  of  design  continuation  are  bsiog 
sought. 


SURVEY  OF  FAULT  TOLERANT  COMPUTING  SYSTEMS 


W.  C.  Ridgwry  111,  Bell  Libs,  Madison  NJ,  July  1973 

1.  IDENTIFICATION 

1.1  NAME:  SAFEGUARD  Dats  Processor 

1.2  RESPONSIBILITY:  Western  Electric  and  BTL 

1.3  SUPPORT r  L\  S.  Army 

1.4  PARTICIPANTS:  Western  Electric  (Prime  Contractor), 

Bell  Laboratories  (responsible  for:  system  deslpn,  design 
of  moat  digital  equipment;  and  designed  some  'vatem 
software),  Unlvac  (dealpned  centeral  processor  .‘nd  some 
diagnostic  programs),  IBM  (dealpned  some  system  software), 
Lockheed  (dealpned  system  core  memories). 

1.5  START;  Dealpn  effort  for  the  ABM  System  was  started 
In  1963. 

1.6  COMPLETION:  Hardware  deslpn  Is  essentially  complete; 
Software  la  In  the  final  stapes  of  development. 

1.7  BlbLIOCRAPUY:  (relevant  to  Fault  Tolerance) 

*  D.  B.  Armstrong,  "A  Deductive  Method  for  Simulating 
Faults  In  Logic  Circuits,"  lEEfc  Transactions  on  Computers, 
Vol.  C-21 ,  No.  5,  pp  464-71,  May  1972. 

*  R,  C..  South,  "A  System  for  Simulating  Faults  in  Lsrpe 
Logic  Circuits,"  (Talk  plven  at  Lehigh  University  Workshop 
on  Fault  Detection  and  Diapnosls  in  Digital  Systems, 
December  B,  1971), 

*  J,  R.  Hahn,  "A  Maintenance  Approach  for  a  Large  Computer 
System,"  (Talk  given  st  Lehigh  University  Workshop  on  Fault 
Detection  and  Diagnosis  in  Dipltsl  Systems,  December  R, 
1971). 


2.  MOTIVATION 

2.1  Pl'RmSFr  Port  of  Miaalle  Defense  System 

2.2  PHYSICAL  ENVIRONMENT:  Cround  Based 

2.3  COMPUTING  ENVIRONMENT l  Interactive  -  real  tire  - 
self-contained. 

2.4  COMPUTING  OBJFCTTVl'G;  'o  provide  real-tine  f'etectlor, 
discrimination,  tracking,  snd  guidance  fiaictlor.s  required 
in  a  miss le  defense  system, 

2.5  RELIABILITY  OBJECTIVES:  To  be  able  to  withstand  most 
system  faults  and  still  perform  the  defense  mission. 

2.6  DYNAMIC  VARIAR1MTV :  Deslpn  allows  graceful 
degradation. 

2.7  PENALTIES:  Loss  of  defense  capability. 

2.R  CONSTRAINTS:  Must  operate  In  real-tire  in  nuclear 
environment  (e.p.,  high  nuclear  radiation  levels  snd  ground 
shock  environment) . 

2  ,r  TRADEOFFS;  Used  (N  +  1)  redundancy  and  on-line 
automatic  diagnostics  Instead  of  full  equipment  redundancy. 


3.  SYSTEM  DESCRIPTION 

3.1  ARCHITECTURE 

3.1.1  CONFIGURATIONS 

3. 1.1.1  INTERCONNECT! VITY :  See  Figure  1. 

3.1. 1.2  RANGE:  As  noted  In  Figure  1 

3.1. 1.3  CAPABILITY:  Clasalfied 

3,1.2  EXEHT1VF 

3. 1.2.1  MODES:  Independent  procesanra  are  not  multi- 
propranvnable;  however,  the  collective  ayatem  la 
multiprogranwcble  and  multiprocessing.  There  la  no 
maater-alave  relationship  between  processors,  therefore  no 
hardcore  (l.e,,  nonredundant  critical  hardware)  exlata. 
Program*  are  segmented  into  taaks  which  are  dispatched  by  a 
scheduler. 

Software  Organiiation:  1/D  processing  is  performed 
asynchronously  by  s  specific  attached  proceaaor  (known  as 
1/0  controllers).  Executive,  monitors,  diagnostics,  and 
other  programs  are  run  by  the  central  processors  as  needed 
once  prior  tasks  ars  completed. 


3.2  FAULT  TOLERANCE 

3.2.1  FAULTS  TOLERATED:  The  ayatem  is  designed  to 
withstand  both  transient  and  hard  faults  provided  the 
problems  in  3.2.2  are  not  met. 

3.2.2  FAULTS  NOT  TOLERATED:  The  ayatem  can  meet 
objectives  unless  either  multiple  faults  occur 
simultaneously  in  enough  different  equipment,  so  that  s 
viahle  system  la  not  available,  or  transient  faults  (which 
sre  not  Isolated)  affect  crlticsl  units  st  an  abnormal lv 
high  race. 

3.2.3  TECHNIQUES:  As  shown  In  Figure  1,  multiple  units 
exist  for  each  major  type  of  equipment  (e.«.,  memory).  One 
of  each  type  of  these  multiple  units  la  Included  as  a  spare 
which  nry  be  substituted  for  any  faulty  unit,  l.e.,  there 
are  "n  +  1"  units  of  each  type,  whare  "n"  la  the  number 
required  to  perform  the  tactical  mission. 

*  System  Reconfiguration  of  faulty  units  la  controlled  by 
special  redundant  status  units  described  In  Section  3.3. 

*  A  special  maintenance  subsystem  is  employed  which  unes  s 
separate  maintenance  path  into  all  major  registers  (both 
dats  snd  control)  in  the  system  (see  Figure  2).  This 
subsystem  la  used  to  bootstrap  the  main  ayatem  for  normal 
Initialization,  to  detect  faults  (via  routine  diagnostics) , 
to  isolate  faults  (using  fault  dictionaries) ,  snd  to 
perform  system  recovery  by  detecting  any  catastrophic 
failure  and  reinitializing  the  ayatem. 

*  Real-time  diagnostics  are  periodically  scheduled  to 
detect  equipment  failures. 

*  Error  detection  and  response  features  are  designed  into 
the  normal  system  software.  These  features  Include 
defensive  programming  (e.g.,  data  reasonableness  checks), 
device  managing  (e.g.,  to  Isolate  faulty  units),  Initiating 
system  rollhsck  to  a  previously  determined  state,  end 
calling  for  the  maintenance  aubayatem  to  Initiate  complete 
system  recovery  (l.e,,  rollback  to  Initial  state) 

*  Redundant  equipment  is  used  (during  routine  surveillance 
periods)  to  play  large  scale  aystem  exercises  against  the 
on-line  system.  These  exerclaea  are  valuable  In  uncovering 
errors,  as  well  as  maintaining  the  skill  level  of  operators 
(snd  thus  minimizing  the  possibility  of  manual  errors). 

*  Rarity  ia  used  to  check  data  acroaa  interfaces  and  in 
memories. 

*  Errqr-de tec ting  codes  are  used  to  check  the  transfer  of 
critical  dats  between  some  subsystems. 

*  The  system  master  clock  employs  triple-modular  redundancy 
to  ( pnerite  sll  major  clock  rates. 

3.3  NOVELTY:  The  aystem  h*S' two  significantly  unique 
fault-tolerant  design  features  aa  described  below. 

j.3.1  The  first  feature  is  the  equipment  status  unit  (SI') 
which  controls  the  total  system  hardware  configuration. 

The  SU  has  the  following  characteristics  and  capabilities, 

*  There  nre  two  identical  SU,  either  of  which  may  be 
designated  ss  the  matter  unit. 

*  All  comma:  lest  ion  pstha  between  equipments  (e.g., 
processor  and  memories)  are  controlled  by  the  SU. 

*  Ry  enabling  (or  disabling)  the  various  communication 
paths,  the  SU  can  split  the  aystem  into  two  separate 
computers  (e.g.,  one  can  be  used  to  exercise  the  other). 
Individual  equipment  can  also  be  isolated,  if  necessary,  to 
allow  diagnostics  to  be  performed. 

*  The  error  detection  circuits  in  each  equipment  send 
repurts  to  the  SU.  Theae  reports  sra  used  by  software  to 
determine  what  equipment  should  have  diagnostic#  performed, 
as  well  as  to  make  reconfiguration  determinations. 

*  Tha  SU  enables  special  teat  paths  in  each  equipment  ao 
that  diagnostics  may  he  performed  aa  dascrlhed  in  3.3.2. 

3.3.2  The  second  unique  fault-tolerant  design  feature 
centers  around  tha  maintenance  aubayatem.  The 
characteristics  and  capabilities  of  this  subsystem  are: 
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SURVEY  or  FAULT  TOLERANT  COmilNC  SYSTEMS 


U.  C.  Rldgw.-v  III,  Rail  Labs,  Madison  MJ,  July  1973 

l.  lnorinrArioN 

1.1  NAfCt  SAFCCl'Ain  Data  Processor 

1.2  RES ROWS I II L ITT:  Western  Electric  and  BTL 

1.3  SrrrotTj  U.  s.  Amy 

l.A  FPRTICIFANTS:  Weston  Electric  vFrtne  Contractor), 

•all  1 •  orator lea  (responsible  fort  oyeten  design,  desiyn 
of  «e*t  digital  equipment;  and  designed  some  syaten 
software) ,  t aivac  (designed  cents ral  processor  and  eoem 
diagnostic  prop rasa) ,  IBM  (designed  sons  syaten  software), 
Lockheed  (desipned  syaten  core  aaworlea). 

1.3  START:  Desiyn  effort  for  the  ABM  Svste*  was  started 
in  19b3. 

1.6  COKFLETIOK:  Hardware  desipn  la  essential lv  comp 'etc; 
Software  is  in  the  final  stapes  of  development. 

1.7  BlbLIOCRAFHY:  (relevant  to  Fault  Tolerance) 

*  P.  d.  Aren troop,  NA  Deductive  Method  for  Slnulatinp 
Faults  in  Lopic  Circuits,'*  IEEE  Transactions  on  roeq»utars, 
Vol.  C-21,  Ho.  5 .  PP  464-71,  Key  1472. 

*  R.  C.  South,  "A  Syete*  for  Slnulatinp  Faults  in  Larpe 
Lopic  Circuits,"  (Talk  piven  at  Lehlph  University  Workshop 
on  Fault  Detection  and  Ptapnoels  in  Dipltal  systems, 
Deceaber  8,  1971). 

*  J.  R.  Hahn.  "A  Maintenance  approach  for  a  Larpe  Coeputer 
Syetes,"  (Talk  piven  at  Lehlph  University  Workshop  on  Fault 
Detection  and  Diapnoels  in  Dipltal  System,  Deceafcer  R, 
1971). 


2.  MOTIVATION 

:.l  Ft'RFOSF  t  Fart  of  Mieslle  Defense  Spate* 

2.2  PHYSICAL  E*.*V! WWMFNT  t  Crowd  Based 

2.3  comnsc  na'IPOWrcrTt  Interactive  -  real  time  - 
self-contained. 

2.4  COMFIT  I  NT  OBJECT! VIT :  To  nrnvfde  real-cine  Cetectlon, 
discrimination,  trackinp,  and  puidance  fur-lion*  required 

in  a  *1  sale  defense  system. 

2.3  RELIABILITY  OBJECTIVES:  To  be  able  to  withstand  most 
system  faults  and  still  perform  the  defense  mission. 

2.6  DYNAMIC  VARIABILITY:  Desipn  allows  rraceful 
depredation. 

2.7  PENALTIES:  Loss  of  defense  capability. 

2.R  CONSTRAINTS:  Must  operate  in  real-tire  in  nuclear 
envlronnent  (e.P.,  High  nuclear  radiation  levels  and  around 
Rhock  envlronnent). 

2.'  TRADEOFFS:  lead  (N  ♦  1)  redundancy  and  on-line 
autonatlc  dlepnostlcs  instead  of  full  equlpsient  redundancy. 


3.  SYSTEM  DESCRIPTION 

3. 1  ARCHITECTURE 

3.1.1  CONFIGURATIONS 

3. 1.1.1  INTERCONNECTIVITY :  See  FI  pure  1. 

3. 1.1.2  RANGE:  Ae  noted  in  Fipure  1 

3. 1.1. 3  CAPABILITY;  Classified 

3.1.2  EXEC1TIVF 

3.1.2. 1  MODES:  Independent  processor*  ere  not  nultl- 
proprasMble;  however,  tbt  collective  system  is 
mult tprogra—cb la  and  multirrocesslnp.  There  la  no 
nastar-slave  relatlonshlr  jet ween  processors,  therefore  ne 
hardcore  (i.e.,  non  redundant  critical  hanhrare)  exists. 
Propraae  are  segmented  into  tasks  which  are  dispatched  by  a 
scheduler. 

Software  Organisation:  I/O  procesainp  is  performed 
asynchronous ly  by  a  specific  attached  processor  (known  as 
I/O  contrellera).  Executive,  monitors ,  dlepnootlea,  and 
other  progtaas  are  run  by  tht  central  processors  as  needed 
ones  prior  tasks  are  conpleted. 


3.2  FAULT  TOLERANCE 

3.2.1  FAULTS  TOLERATED:  The  syaten  is  desipned  to 
withstand  both  transient  and  hard  faults  providad  the 
.•rohlana  in  3.2.2  are  not  amt. 

3.2.2  FAULTS  NOT  TOLERATED:  The  system  can  meat 
objectives  unless  either  multiple  faults  occur 
simultaneously  in  enouph  different  equipment,  so  that  a 
viable  system  is  not  available,  or  transient  faults  (which 
are  sot  isolate  )  effort  critical  units  at  an  abnormal  1** 
hiph  rate. 

3.2.3  TECHNIQUES :  As  shown  in  Fipure  1,  multiple  units 
exist  for  each  najor  type  of  equipment  (a.«.,  memory) .  One 
of  each  type  of  these  multiple  units  la  Included  ae  a  spare 
which  may  be  substituted  for  any  faulty  unit.  I.e.,  there 
are  "n  ♦  1"  units  of  each  type,  whore  "n"  is  the  nuafcer 
required  to  perform  tho  tactical  mission. 

*  System  Reconfiguration  of  faulty  units  la  controlled  by 
special  redundant  statue  units  described  in  Section  3.3. 

*  A  special  maintenance  stdtaystam  le  employed  which  usee  e 
separate  maintenao ■ -e  path  into  all  najor  replsters  (both 
data  and  control)  in  the  system  (see  Figure  2).  This 
subsystem  is  used  to  bootstrap  the  mein  eystev  for  normal 
initialisation,  to  detect  faults  (via  routine  diapnoetlcs) , 
to  Isolate  faults  (ustnp  fault  dictionaries) ,  and  to 
perform  system  recovery  by  detectlnp  any  cataotvoph.u 
failure  and  reinitialising  the  system. 

*  Real-time  diapnoetlcs  are  periodically  scheduled  tu 
detect  equipment  failures. 

*  Error  detection  and  response  features  are  desipned  Into 
the  normal  system  software.  These  features  Include 
defensive  progranadnp  (e.p. ,  data  reasonableness  checks), 
device  manaplnp  (e.p.,  to  isolate  fault*  units),  inltiatlnp 
system  rollback  to  a  previously  determined  state,  end 
calllnp  for  the  maintenance  subsystem  to  initiate  roe? late 
svatem  recovery  (I.e.,  rollback  to  initial  state) 

*  Redundant  equipment  Is  used  (durlnp  routine  surveillance 
periods)  to  rlay  larpe  scale  syaten  exerciaee  apalnet  the 
on-line  syatem.  These  exerciaee  are  valuable  in  uncovering 
errors,  ae  well  as  malntainlnp  the  skill  level  of  operators 
(end  thus  nlnlrilsinp  the  pooaiblllty  of  manual  errors). 

*  Ferity  is  used  to  check  data  acroos  Interfaces  and  In 
memories. 

*  Errqr-datectlnr  codas  are  used  to  check  the  transfer  of 
critical  data  between  some  subsystems. 

*  The  system  neater  clock  employe  triple-modular  redundancy 
to  penerett  all  rs) or  clock  rates. 

3.3  NOVELTY:  The  system  has  two  significant .y  unique 
fault-tolerant  desipn  features  as  described  below. 

3.3.1  The  first  feature  le  the  equlpeent  status  unit  (SI) 
which  controls  the  total  systee>  hardware  eorflpuration. 

The  Sr  has  the  followlnp  characteristics  and  capabl 11  tins , 

*  There  are  two  Identical  SV,  either  of  which  may  be 
desipnated  as  the  mastar  unit. 

*  All  co entail cat  Ion  paths  between  equipments  (e.p., 
processor  and  memories)  are  controlled  by  the  St*. 

*  By  enabling  (or  disabling)  the  various  coamueil  cat  Ion 
paths,  the  tF  can  split  the  system  into  two  separate 
computers  (e.p.,  one  can  be  used  to  exercise  the  other). 
Individual  equipment  can  also  be  isolated,  if  necessary,  to 
allow  dlepnostlcs  to  be  performed 

*  The  error  detection  circuits  In  each  equipment  send 
reporr •  to  tho  5U.  These  reports  are  used  by  software  to 
determine  whet  equipment  should  Levs  dlepnostlcs  performed, 
as  well  aa  to  make  reconfiguration  da terminations. 

4  The  SU  enables  special  test  paths  in  each  equipment  so 
that  diagnostics  may  ha  performed  aa  described  in  3.3.2. 

3.3.2  The  second  unique  fault-tolerant  design  feature 
centers  around  tho  maintenance  subsystem.  The 
characteristics  and  eapebllltieo  of  this  subs vs  tee  are: 
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*  It  Is  a  special  subsystem  emulating  of  lta  own  a  anil 
coapucer  and  associated  aqulpaant  whose  sola  purpose  la  to 
parforn  fault  da  tact  lor,  lsolatloe,  and  recovery.  TMa  la 
dona  In  conjunction  with  tha  status  unit  daacrlbad  la 
auction  3.3.1. 

•  A  apodal  nalntananca  path  la  designed  Into  all  digital 
aqulpaant*  Tha  aola  function  of  thla  path  la  to  allow 
dlagnoatlca;  thla  path.  In  essence,  allowa  break lag  largo 
sequential  blocka  of  logic  Into  aaallar  conbl national 
blocks  which  are  aaalar  to  dlagaoae  for  faulta. 

•  Tha  nalntananca  ays tea  ueee  fault  dictionaries  to 
laolato  faulta.  Tha  dlctlonarlea  ace  searched  by  using 
error  patterns  which  are  collected  froo  testa  on  the 
hardware.  The  dictionaries  are  created  by  a  table-dri  wen 
logic-path  sens it lied,  logic  sluulator,  which  predicts 
possible  error  responses  when  the  diagnostics  are  run 
against  the  slnulated  hardware. 

*  Tha  nalntananca  sub  ays  tan  la  redundant:  each  half 
perform  dlagnoatlca  on  the  other  half. 

3.4  INFLUENCE:  Mala  ays  tea  concepts  and  architecture  were 
developed  by  none roue  naahere  of  staff  within  loll  Labs. 

3.3  HAWVCORE :  There  la  no  hard-core  In  the  aenae  of 
non- redundant  hardware. 


4.  SYSTEM  JUSTIFICATION 

4.1  RELIABILITY  EVALUATION!  Systen  "rail ability"  la 
expressed  aa  A*R  (i.a.,  availability-reliability*’  product), 
where  A*R  la  defined  aa  "the  probability  that  the  ays ten 

*  will  be  available  (l.a.,  error  free)  for  full-load 
operations  «dier  needed,  and 

*  will  continue  to  operate  without  eye ten  failure 
throughout  the  engageuant." 

The  total  systen  A*R  requlreusnt  la  budgeted  across  the 
various  subsyeteus.  Analysis  of  harhtare  reliability 
nodals  for  each  tjbayatan  indicates  that  the  design  should 
aaet  the  A**  requ  tenants.  Diagnostic  progress  are  analvtad 


by  tha  logic  slnulator  (which  also  creataa  tha  fault 
dictionaries)  to  detereim  tha  percent  of  faulta  detected. 
Thla  lnfornatlon  la  used  In  tha  A*R  node  la.  la  addition, 
error  control  software  an  wall  as  diagnostic  program  are 
furthar  taatad  by  ladodng  randonly  aalactad  faulta  la  tha 
hardfate. 

4.2  COFTLETEHESS  OP  EVALUATION:  Analysis  of  Cho  ays tan 
design  by  the  harfeare  A*R  no  da  la  has  baaa  completed, 
except  for  poeslble  updating  baand  on  diagnostic  program 
analysis  results.  About  20X  of  tha  diagnostic  program 
have  been  analysed  by  the  logic  olnulstor.  About  15Z  of 
tha  fault  lass rt Ion  analysis  has  boon  coop 1a tad. 

4.3  OVERHEAD:  For  tha  largsst  systen,  about  10X  of  the 
hardware  la  required  far  tha  aalatananca  a  obeys  ten.  Other 
error  detection  hardware  and  aqulpaant  redundancy  account 
for  possibly  another  15X  of  the  hardware.  Diagnostic 
program  and  defensive  programing  techniques  are  estimated 
to  account  for  About  2SZ  of  tha  total  source  Instructions. 

4.4  APPLICABILITY i  One  or  note  of  the  fault  tolerant 
systen  features  my  be  used  with  met  digital  system. 

4.5  EXTEND ABILITY:  It  would  probably  be  lnpractlcal  to 
extend  any  of  the  fault  tolerant  concepts  In  tha  present 

systen. 

4.6  CRITICALITIES:  Syete*  coat  could  be  significantly 
reduced  by  e  relaxation  In  reliability.  MultlprogramAng 
and  mltlprocaaalng  are  both  required  to  handla  tha 
required  processing  tanka  in  real  tine. 

4.7  IMPLICATIONS:  Because  of  the  stringent  availability 
requirements,  automatic  fault  detactlon.  Isolation  and 
recovery  were  of  prlnary  consideration. 


5.  coNcursirws 

5.1  STATUS:  Tha  prototype  systen  la  now  being  integrated. 

5.2  EXPERIENCE:  Experience  to  date  indicates  that  the 
systen  should  neat  lta  availability/  reliability  goals. 

5.3  FUTURE:  The  systen  la  scheduled  for  installation  at 
various  governaunt  facilities  In  the  near  future. 
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survey  or  fault-tolerant  comjtti'c  system 

Prof.  Jerome  It.  Saltzer 

i'roject  VAC,  BIT,  Cambridge,  1!A. ,  April  P73 

1.  1  OEM  I  FI  CATION 

1.1,  NAME?  Multica,  for  MlTTlplexerJ  Information  and 
Computinp  Service. 

1.2,  RESPONSIBILITY:  Massachusetts  Institut'd  of  Technology, 
Project  MAC,  Computer  System*  KesearmW  Division.  An  of 
1/73,  a  Hnneyvell  product. 

1.3,  SIPPORT:  Advanced  Research  Project*  Agency  vis  office 
of  Naval  research. 

i .u,  PARTICIPANTS:  MIT  Project  MAC;  Honeywell  Carbridge 
Information  Systems  Laboratory  (formerly  tlie  Ceneral 
Electric  Company  Corputer  Department).  Also  Bril  Telephone 
Laboratories,  1965-69. 

1.5.  START:  Planning  In  1963-4,  complete  proposal  In  1965. 

1.6.  COMPLETION t  System  first  uasble  in  196R,  svailsble  for 
public  use  st  H.I.T.  in  1969,  now  commercially  available. 
Research  continuing. 

1.7.  BIBLIOGRAPHY; 

•The  Multiplexed  Information  snd  Computing  Service: 
Programmers*  Manual,  H.I.T.  Project  MAC,  Rev.  13,  1973. 

•F.J.  Corbato,  et  si..  Session  6:  A  new  remote  accessed 
ran-Macliine  svstem,  AFIFS  Conf.  Proc.  27,  (FJCC  1965), 
Spsrtsn  Books,  Washington  P.C.,  pp.  1P5-247. 

•F.J.  Corbato,  J.11.  Ssltzer,  and  C.T.  Clingen,  "Multica  — 
the  first  seven  veers,"  AFIPS  Conf.  Proc.  40,  (SJCC  1972), 
AFirS  Press,  Itontvale,  N.J.,  pp.  571-5R3. 

*K.l,  Organlck,  Tlie  Multlcs  System:  sn  Examination  of  its 
Structure,  VI T  Press,  1972. 


2,  MOTIVATION 

?. 1.  PURPOSE:  Multlcs  is  a  prototype  of  the  general-purpose 
computer  utility.  It  is  intended  to  allow  interactive 
sccess  to  s  shared  information  base,  permit  use  of  penersl 
purpose  programming,  snd  be  expandable  and  evolvflble. 

Re 11 shill  tv  and  fault  tolerance  were  considered  to  be  only 
two  of  many  overlapping  snd  confllctinr  objectives, 

2.2.  PHYSICAL  ENVIRONMENT?  Multica  ia  designed  fo«  use  in  a 
pround-hsaed  data  processing  center. 

2.3.  COMP  UT INC  ENVIRONMENTS  Multica  ia  approached  by 
interactive  dlaplaya  and  typewriter  terminals.  For 
large-volume  data  processing  applications,  card,  printer, 
and  magnet!:  ta,te  peripherals  are  provided,  hui  all  Job 
initiation  la  dune  interactively.  Terminals  are  attached 
directly,  via  the  dial-up  telephone  network,  and  via  the 
ARPA  network, 

2.4.  COMPETING  OBJECTIVES:  Multica  provide!  a  wide  range  of 
software  services,  languages,  and  tools  for  conatructinp 
propraw  and  subsystems.  It  provides  interactive  response 
to  small  requests  at  the  level  of  2  seconds  average,  5 
seconds  for  907  of  requests.  Larger  corpute-bound  requests 
are  scheduled  at  a  leader  prioritv.  With  initial  (1964) 
hardware,  configurations  supporting  fi'om  10  to  120 
aimultaneoua  uaera  can  be  constructed .  Hardware  installed 
in  fall  1972  increases  the  potential  limit  to  shout  400 
users,  and  also  improves  response  time.  Software  design 
range  is  from  10  to  1000  users. 

2.5.  RELIABILITY  OBJECTIVES:  The  primary  reliability 
ohjectlve  concerns  integrity  of  on-line  file  storage. 

Ideal lv,  the  user  can  rely  on  the  system  to  have  a  perfect 
memory  'or  hi  a  files.  A  secondary  availability  objective 
is  that  the  system  operate  continuously,  on  a  24-hour  per 
day  basis.  Recovery  time  following  s  failure  is  permitted 
to  have  a  wide  variation,  but  sn  average  on  the  order  of  s 
few  minutes,  nhjectives  such  ss  1007.  continued  operation 
in  the  face  of  snv  single  failure  were  not  attempted. 


2.6.  DYNAMir  VARIABILITY:  An  individual  Installation  may 
choose  the  fraction  of  system  resources  to  be  used  for  file 
backup  operation,  thereby  providing,  varying  degrees  of 
maximum  setback  for  its  users  following  the  worst  possible 
kind  of  s  system  crash.  If  ?0!r  of  resnurces  are  uoed  for 
backup,  a  maximum  of  TO  minutes  of  work  esn  be  lost  by  s 
system  crash.  Smaller  quantities  of  backup  can  produce 
larger  setbacks.  Tlie  design  is  multiprocessor,  to  permit 
restart  with  s  smaller,  liwer-perfnrmance  configuration, 
without  waiting  for  hardware  to  be  repaired, 

2.7.  PENALTIES:  Penalty  depends  on  the  ranee  of 
applications  for  which  the  system  Is  heinp  used.  In  the 
M.l.T,  environment,  loss  of  stored  files  or  lack  of  system 
availability  may  mean  disruption  of  administrative  and 
departmental  operations  which  use  the  system.  For 
programming  use,  the  penalty  is  small. 

2. P.  CONSTRAINTS :  Multica  is  Intended  to  be  economically 
competitive  with  other  commercial  and  scientific  drts 
processing  systems.  No  unusual  physical  constraints  exist, 

3.  DESCRIPTION 

3.1.  ARCHITECTURE 

3.1.1.  CONFIGURATIONS 

3. 1.1.1.  INTEPCONNECTIVITY:  The  hardware  (llonevwell 
600/6000)  ia  a  mult iproceasor,  multimemory  deaipn  in  which 
each  processor  is  connected  by  s  separate  cable  to  esch 
s*mory  box.  I/O  control lera  are  attached  to  the  memory 
boxes  In  the  same  wav  aa  nroceaaors. 

3. 1.1.2.  FvANCF. ?  Software  permita  1-10  processors,  128K  to 
16M  )6-hi t  words  without  change.  Small  chanpes  would 
permit  essentially  unlimited  (e.g.,  10F14  words)  memory 
sizes.  Current  hardware  permits  1-7  processors,  12flK  to  21'. 
)6-bit  words.  Small  changes  in  hardware  would  permit  up  to 
16M  words, 

3. 1.1.3.  CAPABILITY:  Honeywell  645  CPU  runs  st  330,000 
instruction* /sec,  about  half  the  speed  of  s  360/65, 
lloneyve1 1  6 Iff)  CPI'  Is  expected  to  run  shout  1H 
instructions/sec,  somewhere  between  a  370/155  and  s 
370/165. 

3.1.2.  FXFCHTiVE 

3. 1.2.1.  MOOES :  System  permits  user-constructed  cooperative 
processes,  utilizing  tnultiproce'  ;ors  and  mult  1  pro  grumping. 
The  multiple  processors  run  independently  snd  autonomously 
rather  than  in  s  master/alave  processor  organization. 

3. 1.2.2.  SOFTWARE:  The  syster  software  appears  to  each  user 
ss  a  prlvste  supervisor  realdinp  within  his  personal 
address  apace.  A  amall  section  of  the  supervisor  la  core 
resident;  the  rest  of  the  supervisor  ss  well  as  all  uaer 
proprams  and  data  are  paged.  All  programs,  Including  the 
core  resident  supervisor,  sre  in  the  virtual  memory. 

3.2.  FAULT  TOLERANCE 

3.2.1.  FAULTS  TOLERATED:  All  forms  of  hardware  and  software 
failures  which  are  severe  enough  to  cause  a  system  crash 
result  in  a  service  outage  ranging  from  s  few  mlnutea  to  s 
few  hours,  followed  bv  svailsbility  of  a  reinitialized 
system.  All  files  sre  preserved,  but  computations  in 
progress  mjst  be  restarted  from  tlie  beginning  or  from  the 
I.  it  checkpoint  which  the  user  has  provided.  If  the 
operations  staff  has  been  well-organized  in  protecting  tape 
copies,  it  is  possible  to  completely  snd  automatically 
recover  even  from  s  fire  which  deatroya  tlie  computer  system 
(given  enough  time  to  install  replacement  hardware). 

3.2.2.  FAULTS  NOT  TOLERATEO:  Failures  involving  physical 
destruction  of  on-line  storage  devices  (e.g.,  disk  hesd 
crashes)  are  tolerated,  but  can  result  in  outage  of  up  to 
several  hours  while  reconstruct  ion  of  the  on-line  files 
from  hack  up  copies  la  performed. 

3.2.3.  TECHNIQUES:  Backup  copying:  When  an  on-line  file  ia 
crested,  within  a  half  an  hour,  a  backup  copy  is 
automatically  made  on  a  Journal  tape.  Once  each  day,  an 
extra  set  of  journal  tapes  sre  Independently  written, 
containing-  copies  of  all  files  created  since  the  previous 
day.  Once  esch  week,  s  logical  copy  of  every  on-line  file 
is  made  onto  tape,  to  limit  the  number  of  journal  tapeo 
which  must  he  scanned  to  reconstruct  the  on-line  files. 

(The  times  of  1/2  hour,  1  day  and  1  week  »re  adjustable  hy 
the  installation  to  local  needs.) 


•Salvaging:  Following  a  ayatem  craah  tor  any  reason,  a 
salvager  program  inspects  the  condition  of  all  on-line 
fllea  and  direct orlea,  and  reporta  any  uncorrect  able 
inconsistencies  or  irregularities  in  content  and  format.  A 
anall  amount  of  rediwdancy  la  used  in  directory  structures, 
to  aaalat  the  aalvager. 

•On-line  salvaging:  Whenever  an  loconaiatent  directory 
entry  la  diacovered  duriog  normal  operation,  a  veraion  of 
the  aalvager  la  lMedlately  invoked  to  correct  the 
aituation.  Normally,  aervlce  la  not  Interrupted. 
•Retrieval:  If  the  aalvager  finds  It  impossible  to 
reconatruct  one  or  a  few  fllea,  but  the  nuafcer  la  email 
enough  that  the  expense  of  a  complete  file  system 
reconatruction  from  backup  tapes  la  not  warranted,  the  user 
of  the  file  la  notified,  and  he  may  initiate  retrieval  of 
hia  file  from  the  hackup  or  journal  tapea.  Retrieval  of 
older  copies  may  alao  be  requested  by  the  user  if  he 
accidentally  damages  or  deletea  the  currant  on-line  copy  of 
a  file. 

•Continuous  Operation:  The  system  la  dynamics1 \y 
reconflgurehle,  which  means  that  processors  and  memory 
boxes  may  be  added  or  removed  while  the  ayster  la  running  a 
production  load.  This  technique  permits  both  hardware  and 
software  maintenance  to  be  performed  on  detached  uulbS. 
Since  in  addition  the  software  ayatem  may  be  loaded  onto 
any  available  configuration  of  procaaaora  and  memory  boxes, 
recovery  following  a  aolid  hardware  failure  can  be  very 
rapid. 

3.3.  NOVELTY:  The  primary  novelty  of  Multlcs  in  this  ares 
la  that  the  reliability  objectives  have  been  integrated 
into  a  general-purpose  computer  programming  ayatem  which 
alao  meets  a  wide  variety  of  other  objectives.  As  far  as 
la  known,  Multica  ia  the  first  general  nurpoae  ayater  to 
permit  dynamic  reconfiguration  of  processors  and  remory. 

3.4.  INFLUENCES:  Experience  in  designing  and  uslnp  the 
Compatible  Time  Sharing  System  for  the  IBM  7094  provided 
the  moat  obvious  Influence.  The  multiprocessor 
organisation  was  Influenced  by  the  Burroughs  DB25  computer 
ayatem. 

4.  JUSTIFICATION 

4.1,  RELIABILITY  EVALUATION:  In  ao  rperatio:  sl  environment 
at  M.l.T.  for  several  years,  the  rat-!  of  loai  of  files 
because  of  eyeten  failures  hai  heen  low  enougi  to  he 
acceptable  to  the  user  community,  but  hse  not  u ?en 
evaluated.  The  average  tine  down  when  a  failure  c'cura  ia 
about  10-15  minutes. 

4.2.  OVERHEAD:  Hardwars  naglbly  redundant.  '•  arlablb 
software  overhead  for  backup.  (See  2.6.) 

4,7.  IMPLICATIONS:  For  the  fill  backup  Procedure  to  be 
effective,  it  la  esaentlal  that  the  computer  operating 
staff  be  highly  organised,  and  that  the  operation! 
management  thoroughly  understand  lta  responsibility  in 
helping  safeguard  user  files  stored  on-line.  (For  example, 
sloppy  tape  storage  management  cannot  be  tolerated.) 


5.  CONCLUSIONS 

5.1.  STATUS:  The  ayatem  haa  been  operational  at  M.l.T.  for 
4  years  and  la  the  primary  tlme-aharlng  ayatem  there.  It  ia 
alao  In  use  at  3  other  sites,  on  order  at  aeveral  others. 

5.2.  EXPERIENCE;  The  design  aeema  to  be  adequate  for  the 
quantity  of  storage  currently  being  managed  (100  million 
words),  but  maximum  reload  tiima  art  proportional  to  this 
quantity  of  on-line  storage  and  are  near  the  limit  of 
tolerance.  A  revised  reload  strategy  employing  parallel 
processes  la  expected  to  provide  an  order  of  magnitude 
Increase  in  the  practical  atorage  quantity  limit. 

5.3.  FUTURE:  Research  on  many  aspects  of  computer 
operating  system#  otne*  than  reUahlllty  la  continuing, 
using  Vultlcs  as  a  laboratory  vehicle. 
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1.  IDENTIFICATION 

1.1.  NAME:  No.  1  ESS.  A  number  of  electronic  switching 
ays  tens  have  been  designed  by  Bell  Laboratories  during  the 
past  several  years.  Those  which  have  been  described  in  the 
literature  include  No.  1  ESS,  No.  101  ESS,  No.  2  FSS  and 
the  Truffle  Service  Position  Systems  (TSPS).  This  response 
will  be  concerned  exclusively  with  No.  1  ESS,  a  large 
telephone  central  office  designed  primarily  for  service 
application* 

1.2.  RESPONSIBILITY:  The  Indian  Hill  Switching  Division, 
Nape rvi lie  Laboratory  of  Bell  Laboratories. 

1.3.  SUPPORT:  Development  of  the  Syaten  aupported  by 
Western  Electric  Company  (WE),  the  manufacturing  unit  of 
the  Bell  System. 

1.4.  PARTICIPANTS:  The  system  w»b  designed  and  developed 
by  Bell  Laboratories,  is  manufactured  and  installed  by  the 
Western  Electric  Company  and  ia  operated  by  the  verloua 
Bell  System  operating  companies. 

1.5.  START:  Active  work  on  the  design  of  No.  1  ESS  began 
in  late  1959. 

1.6.  COMPLETION:  "Hie  first  avatem  waa  put  Into  aervlce  In 
Succaaunna  in  1964.  Both  hardware  and  software 
improvements  have  been  made  In  the  ayatem  from  that  time. 

1.7.  BlRLlOCRAPinf:  Thr  baalc  description  of  No.  1  ESS  la 
in  the  September,  1964  laaue  of  the  Bell  System  Technical 
Journal,  In  addition  the  following  bibliography  deals 
specifically  with  the  problems  covered  in  this  survey. 

•  Downing,  R.  W, ,  et  sl.,  "No  1  ESS  Maintenance  Plan,"  Bell 
System  Technical  Journal,  Vol.  43,  p0.  1961-2020, 

September,  1904. 

•  Beuacher,  H,  J.,  e„  al.,  "Administration  and  Maintenance 
Plan  of  No.  2  ESS."  Rell  System  Technical  Journal,  Vol.  40, 
pp.  2765-2R15,  October,  1969. 

•  Chang,  H.  Y,  and  Thomas  W. ,  "Methods  of  Interpreting 
Diagnostic  Data  for  Locating  Faults  in  Digital  Machines," 
Bell  System  Technical  Journal,  Vol.  46,  pp.  209-310, 
February,  1967. 

•  Taiang,  S.  It.,  Maugk,  C.  end  Seekler,  H.  N.,  "Maintenance 
of  e  Large  Electronic  Switching  System,"  IEEE  Transactions 
on  Coanunlcatlona  Technology,  pp.  1-9,  February,  1969. 

•  Altcheaon,  E.  J.  and  Cook,  R.  F.,  "No.  1  ESS  ADF 
Maintenance  Plan,"  Bell  System  Technical  Journal,  Vol.  49, 
No.  10,  pp.  2031-2056,  Deceafcer,  1970. 

•  Nowak,  J.  S,  and  Tuomenokaa,  L.  S.,  "Memory  Mutilation  in 
Stored  Program  Controlled  Telephone  Syatem,"  1970  IEEE 
International  Conference  of  Communication!,  pp, 

4  3-32-4  >45. 

•  Chang,  II.  Y.  and  Scanlon,  J.  M,#  "Design  Principles  for 
Proceaaor  Maintainability  in  Real-Time  System*," 

Proceedings  of  Fall  Joint  Computer  Conferences,  pp. 

319-32R,  P69. 

•  Nowak,  J.  S.,  "Emergency  Action  for  No.  1  ESS,"  Bell 
Laboratories  Record,  Vol.  49,  No.  6,  pp.  176-179, 

June /July,  19  71. 

•  Co, met ,  J.  R.,  Pasternak,  E.  J.  and  Wagner,  R.  D.( 
"Software  Defenses  in  Real-Time  Control  Systems,"  Second 
Annual  International  Symposium  on  Fault  Tolerant  Computing, 
June  19-21,  1972,  Boston,  Massachusetts. 

•  Alnqulst,  R.  T.,  et  al,  "Softwart  Protection  in  No.  1 
ESS,"  1972  IF.  EE  Conference  on  Cosnisi  lest  lona ,  June,  1972. 

•  Ketchledpe,  R.  W.,  "Service  Experience  with  No.  1  ESS 
Equipment,"  International  Conference  on  Electronic 
Switching,  1966  Proceedinpa,  Paris,  Edition  Chiron,  pp. 


*  Vaughan,  II.  E. ,  "Experience  with  the  No.  1  ESS," 
International  Conference  on  Electronic  Switching,  1966 
Proceedings,  Paris,  Edition  Chiron,  pp.  704-711. 


*  Hsugk,  C.#  "Early  No.  1  ESS  Field  Experience*.  Part  1. 
2-Wire  Syatea  for  Commercial  Implications IEEF 
Tranaection*  on  Coaauu  :ation*  Technology,  Vol.  15,  op. 
744-750,  Deceiver,  1967. 

•  Seckler,  H.  N.,  "Early  Ho.  1  ESS  Field  Experience,  Part 
2,  4-Wire  Syatea  for  Government  end  Military  l^llcstlons," 
IEEE  Trenaactiona  on  Co«u\ic*tiona  Technology,  Vol.  15, 
pp.  751-754,  Decoder,  1967. 

*  Johanneaan,  J.  D.,  "No.  1  ESS  Service  Experience  - 
Software,"  IEEE  Conference  on  Switching  Technique*  for 
Telecoawra  lest  ion  Network*,  fonfereoce  Publicetion  No.  52, 
pp.  459-462,  April,  1969. 

•  Ste^hlcr,  R.  E«,  “No.  1  ESS  Service  Experience  - 
Har&are,"  IEEE  Conference  on  Swltoilng  Technique*  for 
Teleco»«icetion  Networks,  Conference  Publication  No.  52, 
pp.  463-466,  April,  1969, 


2,  MOTIVATION 

2.1.  PURPOSE:  Control  the  eetting  up  end  diaconnection  of 
call*  between  telephone  cuatoaera  atteched  to  the  system  or 
between  thcae  telephone  cuatoaera  and  other  cimtomere  in 
diatant  centrel  office*. 

2.2.  ENVIRONMENT:  The  ayatee  eu*t  operate  in  the  preaently 
cxiating  telephone  plant  and  auat  coaeuiicate  with 
telephone  cuatoaera  and  other  cxiating  central  offlcea. 

2.3.  COMPrriNC  ENVIRONMENT:  The  avatar  doe*  internal  data 
proceaalng  relatiog  the  signals  tranaaltted  by  euetoaera 
and  by  other  central  office*  to  the  dealrcd  telephone 
connect  Iona,  lta  input*  are  theee  elgnal*  *a  gathered  by 
peripheral  equipaent  eeaocieted  with  the  central  procaaalng 
wit  and  lta  output*  are  control  algnala  to  a  telephone 
awltchlng  network  and  output  algnala  which  are  tranaaltted 
to  diatant  central  office*. 

2.4.  COMPUTING  OBJECTIVES:  The  baalc  objective  of  the 
ayatea  wa*  to  handle  100,000  peak  bi»y  hour  cells.  While 
the  orlgloel  re re ion  of  the  ayatem  did  not  meet  thi*  goal, 
aoftware  laproveaant*  have  allowed  thi*  goal  '  o  be  wt 
during  the  peat  year. 

2.5.  RELIABILITY  OBJECTIVES:  Reliability  objective  for  the 
ayatea  wee  a  down  tlae  of  no  aore  than  2  hour*  in  40  yeera. 
When  the  down  tlae  objective*  were  originally  aet,  thla 
down  tlae  waa  predicted  to  be  due  prlaarlly  to  simultaneous 
hariWare  failure!  of  duplicated  processor  unit*.  As  It 
turned  out,  eoftwere  felluree  or  huaan  failures  leading  to 
aaaslve  aaaoty  Mutilation  hav*  been  the  prlaary  aource  of 
down  tlat.  In  recent  ycere,  the  down  tlae  hee  been 
spprotchlng  the  renge  of  10-15  houre  per  40  year*  and  ia 

a  till  going  down  froa  thi*  point. 

2.6.  DYNAMIC  VARIABILITY:  Reliability  in  2.5  abov*  has 
been  defined  in  tec -a  of  total  ayatea  reliability.  Dynamic 
variability  can  be  thought  of  in  tera*  of  the  eblllty  to 
hendle  telephone  traffic  lo  the  presence  of  overload 
exceeding  the  capability  of  ,  ic  syeter  A  dynamic  overload 
response  has  been  built  into  the  system  which  allows 
additional  service  requests  to  be  throttled  during  period* 
of  excessive  demand. 

2.7.  PENALTIES:  Penalties  for  total  ayatea  failure  may 
include  the  inability  to  make  a  telephone  call  at  a 
critical  time,  with  resultant  poeslblc  Ices  of  life  and/or 
property.  For  exaaple,  the  inability  to  call  the  fire 
department  can  ba  quite  serious.  Nowcver,  the  penalty  ia 
dependent  on  the  time  of  the  occurrence  of  the  failure.  In 
many  cases  no  penalty  will  result. 

2.B.  0FP1CE  CONSTRAINTS:  The  equipaent  mat  be  installed 
in  e  telephone  central  office.  It  la  desirable  that  it 
operate  with  normal  Bell  System  nominal  48-vclt  battery  a a 
the  prlaary  pever  source.  Minimum  apace  la  dealrablc  but 
not  critical  since  the  cost  of  space  la  coaparabls  to  the 
normal  coat  of  office  and  factory  apace.  Air  conditioning 
la  normally  provided  hut  the  ayatea  auat  be  able  to  work 
for  moderate  period*  of  tlae  without  elr  conditionlog.  The 
cooling  system  consists  of  nomal  convsctlon  cooling 
augmented  by  conventional  air  conditioning. 


2,9,  TRADE0PPS:  System  capability  and  ayatea  storage  caste 
ere  eaong  the  mein  tradeoffs  available  in  the  eystaa.  The 
uaer  of  a  read-only  program  aaaoty  means  that  all  program 
storage  mat  be  paid  for  on  e  permanent  bawls.  Hie  range 
of  office  alee*  encountered  in  the  Bell  System  aaaar  that  e 
change  in  syatea  capacity  will  effect  the  aaxfcet  for  a  Mo. 
t  ESS.  Price  was  a  vary  1  sport  ant  factor  eiacs  a  Mo.  1  ESS 
provides  the  seas  basic  type  of  telephone  service  available 
from  older,  efficient,  end  '.datively  Inexpensive  telephone 
systems.  Price  differential  must  be  Justified  la  terms  of 
greater  flexibility  for  future  changes  and  long  term  lower 
coata  due  to  automated  manufacturing  techolquea. 

3.  DESCRIPTION  OF  THE  SYSTEM 

3.1.  ARCHITECTURE 

3.1.1.  CONFIGURATIONS 

3. 1.1.1.  INTERCONNECT! V1TY:  The  basic  block  diagram  of  the 
ayaten  is  presented  in  the  BSTJ  reference  (first  article). 
Basically  each  central  control  has  in re*  bus  eysteae:  a 
peripheral  bus  ayatea,  including  an  addressing  eyotan,  a 
unit  selection  system,  and  a  reaponae  bus;  a  read-writs 
■tore  (call  store)  bus  system  with  addressing,  deta-wrlte, 
and  dsta-resd  sections;  end  a  read-only  memory  (program 
store)  bus  system  including  addressing  end  response 
information.  Each  of  the  central  controls  has  full  access 
to  all  busses.  The  two  central  controls  are  Interconnected 
by  natch  buaasa  to  slim  information  in  the  two  controls  to 
be  matched.  In  the  normal  mode  only  one  csnersl  control 
has  control  accaas  to  the  peripheral  bus  syatea  although 
both  central  control*  ilaten  to  the  response  bus.  Each  of 
the  central  control*  in  the  normal  node  controls  one  set  of 
e  duplicate  set  of  store*.  Nowever,  it  le  possible  for  one 
central  control  to  control  all  stores  end  for  the  centrel 
controls  to  alternate  in  controlling  the  peripheral  bus 
syatea.  All  critical  equipaent  which  includes  stores, 
central  control*,  busses  and  peripheral  contrul  unit*  are 
duplicated. 

For  larger  eysteaa,  a  signal  procsssor  lu  placed  on  the 
call  store  hue.  This  algnel  processor  hae  accaas  to  lta 
own  read-write  aenorlsa  end  also  has  access  to  the 
peripheral  bus  syatea.  The  signal  processor  then  la  used  to 
cootrol  input/output  equipaent  such  ea  signaling  equipaent 
and  the  switching  network. 

3.1.1. ?.  RANGE:  Only  one  baalc  central  processor  1*  uac.’ 
in  any  ayatea,  defining  a  central  processor  as  a  duplicated 
central  control,  duplicated  signal  processors  when 
required,  and  duplicated  stores.  The  duplication  of  the 
memory  modules  1*  such  thet  each  module  le  effectively 
divided  into  two  perts;  therefore,  an  odd  mmfcer  of  modules 
can  exist  in  the  eyetea.  The  limit  on  the  mmfcer  of 
read-only  stores  Including  duplication  la  12,  each  of  which 
contains  131,000  44-blt  words  (the  44  bite  era  37  bits  of 
information  and  7  bit*  of  Haamdng  code);  th^  Halt  on  the 
number  of  cell  store  module*  la  about  ten,  each  containing 
32,000,  24  bit*  per  word.  The  original  eystaa  contained 
8,000  word  modules,  but  thla  year  we  have  eterted  using  the 
larger  slsea, 

3. 1.1.3.  CAPABILITY:  The  beat  way  of  indicating  the 
capacity  of  procsssor*  1*  in  tera*  of  the  mmfcer  of  calls 
which  can  be  handled;  aa  indicated  earlier  this  figure  now 
exceeds  100,000  during  the  peak  busy  hour.  The  baalc  cycle 
time  of  the  syatea  la  5.5  microseconds  during  which  a 
complete  addition  can  be  performed.  Progrea  and  data  can 
be  read  in  parallel.  The  order  structure  of  the  eyetea  ia 
sufficiently  powerful  that  the  5.5  viicroseconde  time  give  e 
misleading  by  low  indication  of  the  bsJlc  power  of  the 
processor.  In  general  taraa.  It  might  be  cosiparcd  in  power 
to  an  IBM  7094  computer. 

3.1.2.  EXECUTIVE  AND  0PERAT1NC  SYSTEMS 

3.1 .2.1.  MOOES  OF  OPERATION:  Hie  t'.gnel  procsssor  operates 
independently  of  th*  central  control.  The  ceotrnl  control 
handle*  all  telephone  call*  in  the  office  on  e  tlae  shared 
bsala  working  on  one  call  at  a  time  but  only  doing  part  of 
the  work  necessary  to  process  that  call.  Work  is  tias 
sliced  so  that,  in  general,  no  single  task  should  exceed 
about  20  milliseconds  of  processor  time.  In  an  office 
without  a  algnel  processor,  1/0  is  carried  out  by  an 
Interrupt  level  program  which  takes  coamtnd  of  the  system 
every  millisecond*. 


A2.24 


3. 1.2.2.  SOPTVAM  ORGANIZATION:  Tuki  arm  hi. 
executive  pro area  which  rh.-w.  «  ar"  dl>P«naad  by  an 

hopper*  to  “  Sil/?;?;  lndiuvldu*1  t-k  n quest 
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that  ail  taak  hopper*  arc  •»»!  3  ch>«*  la  oada  to  insure 

int. ractlon.  b.2K.  tta  „£S£!  ZTlZl’i" ^ 

•«!  Interrupt.  which  u..dfor*/O^L,?  h*rd“*M 
troi* la  dit.ctlon  mi  tL.  °P*r,tI[m*  “<1 

circuit  which  nconliii  «...  1  ,  th>  '"'Wer  action 

.«i.f.ct.mr”^;"  tf;  ™  °f  jh*  ■*•«••  i°  cyci* 

■  quipunr.  ^  auaaa  a  switch  to  standby 

3.2.  FAULT  TOLERANCE 

J"  ;ol-r«*  faults  In 

standby  iqnlnunt  h  ^  v  bitching  to  the  appropriate 

raSSSSSStr??1' 

were  u..d  in  £*,°'  'J*  *“*•  cl«u“  «r«ln.  which 
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3.2.3.  techniques 

*  Duplication.  All  critical  aqulfMnt  1.  dupll„t.d. 
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SURVEY  OF  FAULT- TO LE RANT  COMPUTING  SYSTEMS 

Prof.  Otnlel  Slewlorak,  Computer  Sciatica  Dept,  Cara*  a  la¬ 
tte  lion  VaI  vanity,  Pittsburgh,  Pa.  15213,  April  1973 

1.  IDENTIFICATION 

1.1.  NAME:  C.wm?  (multi-minl-processor) 

1.2.  RESPONSIBILITY:  Computer  Sclanca  Department, 
Camagla-Mallon  Unlvaralty 

1.3.  SUPPORT:  ARPA 

1.4.  PARTICIPANTS;  C.C.  Ball,  B.  Broadlay,  E.  Cohan,  A. 
Jonaa,  R.  Levin,  J.  McCradla,  A.  Navall,  C.  Plaraon,  F. 
Pollack,  R.  Raddy,  W.  Wulf,  and  nany  othara. 

1.5.  START:  Mid- 197 1 

1.6.  COMPLETION:  Kld-1973  (hardware,  Initial  software) 

1.7.  B1BL10CRAPHY: 

•  Ball,  C.C. ,  U.  Broadlay,  W.  Wulf,  A.  Newell,  C.  Plaraon, 

R.  Raddy,  and  S.  Rapa,  "C.raq>:  The  CMU  Multi-minl-processor 
Coaputar  -  Requirements  and  Ore rv lev  of  tha  Initial 
Design,"  Augimt,  1971.  Caraagla  Mallon  University, 

Coaputar  Sclanca  Department  Raaaarch  Report.  (AD  7399  6  3) 

*  U.  Wulf,  "c.wp:  A  Mult  lain  lproceaaor,”  Counter  Sclanca 
Raaaarch  Review,  Carnegle-Mellon  Unireralty,  1971-1972. 

>  S.  Fuller,  R.  Swan,  w.  Wulf,  Tha  Inatrumcntatlon  of 
C.aap,  A  Multlainlprocaeaor,  COMPCON  73,  pp.  173-176,  1973. 

2.  MOTIVATION 

2.1.  PURPOSE:  General-purpoee  and  raal-tlaa  computing 

2.2.  PNYS1CAL  ENVIRONMENT:  Croiaid-baaad 

2.3.  COMPUTING  ENVIRONMENT:  Initially  at  and- a  lcme; 
eventually  on  tha  ARPANET. 

2.4.  COMPUTING  OBJECTIVES:  C.aap  v«a  designed  to  provide 
a  raal-tlaa  processing  and  tlaa  sharing  environment,  a.p., 
for  teeearch  In  speech  and  vlalon.  Thue  apaclal  high  data 
rata,  raal-tlaa  interfaces  era  required  to  acquire  apaach 
and  vlalon  data  froa  tha  external  aovlronaent.  Alao, 
real-time  proceealng  for  tha  epasch-tmdsrs  tending  eyater  la 
an  ultlaata  goal.  Execution  of  up  to  3  to  15  million 
Instruct  lone /aac  achieved  through  1-16  aaaory  aodulaa  (650 
nsac  cycle  time)  with  up  to  256K  words  each,  1-16 
processors  (PDP-11),  16x16  croaabar  switch  with  80xl0E6 
vorde/aecond.  capacity. 

2.5.  RELIABILITY  OBJECTIVES:  Since  tha  ayatam  la  ground 
baaed  and  maintenance  la  available,  the  c*jor  reliability 
objective  la  high  availability.  With  the  ability  to 
dynamically  reconfigure  tha  ayatam,  tha  ultimata  goal  la 
continuous  availability. 

2.6.  DYNAMIC  VARIABILITY:  Reliability  can  ba  traded  for 
parformanct  by  1)  parallel  and  Independent  computations  on 
different  procaeeore  and/or  hy  2)  graceful  degradation, 
poeelbly  even  on  a  millisecond  ecala. 

2.7.  PENALTIES:  Mutilation  of  data  In  critical  ayatam 
tables  could  causa  a  ayatam  crash.  Lose  of  experimental 
data  or  actlva  progrars  wruld  rseult. 

2. R.  CONSTRAINTS :  Tha  major  constraint  was  coat.  Tha 
objective  wae  to  build  a  high-performance  ayatam  using 
off-the  shelf  components  which  could  out-perform 
conventional  eyetame  for  a  fraction  of  tha  coat.  The 
preeance  of  multiple  copies  of  various  components  In  tha 
ayatam  also  provides  opportunities  for  a  fault-tolerant, 
highly  available  ayatam. 

2.9.  TRADEOFFS:  Hardware  efficiency  (cost  par  unit  work) 
can  be  traded  for  performance  and/or  reliability. 

3.  DESCRIPTION  OF  THE  SYSTEM 

3.1.  ARCHITECTURE 

3.1.1.  CONFIGURATOR 

3. 1.1.1.  INTERCONNECT! VI TY:  The  configuration  la  basically 
a  conventional  mult lproceaaor  ayatam,  but  on  a  much  larger 
acale  than  In  existing  systems.  Tha  structure  of  tha 
ayatam  la  given  In  Figure  1. 

There  are  two  switches,  Smp  and  Skp,  Smp  allow*  the 
processor  to  comsnmlcata  with  primary  memoriae.  Skp  allows 
ths  processor  to  comswilcsts  with  the  varloue  controllers 
(K) ,  which  In  turn  manage  tha  secondary  memoriae  (me) ,  and 
I/O  devices  (T).  These  switches  are  under  both  computar 
and  manual  control. 


Each  processor  ayatam  la  actually  a  complete  coaqtuter  with 
Its  own  local  primary  memory  and  controllers  for  secondary 
memoriae  and  davlcas.  Each  processor  ham  s  Dsta  Operations 
component,  Dmap,  for  translating  addressee  st  the  processor 
into  physical  memory  addressee.  Ths  locsl  memory  serves 
both  *u  reduce  ths  bandwidth  requirements  to  ths  central 
memory  snd  to  allow  completely  Independent  operation  and 
off-line  maintenance.  Below  v«  dejcrihs  some  of  tha 
apeclflc  components  shown  In  Figure  l, 

*  K. clock:  A  central  clock,  K.dock,  allows  precise  time 
to  jeasured.  A  central  time  baas  la  broadcast  to  all 
proeeeeors  for  local  Interval  timing  and  Interruption. 

*  Smp:  This  switch  handles  lnformai!on  transfers  between 
primary  memory,  processors  and  1/0  davlcea,  Tha  switch  has 
porta  (l.«t.,  connections)  for  m  busses  for  primary  memoriae 
and  p  busses  for  processors.  Up  to  mln(m,p)  simultaneous 
conversations  are  possible  via  tha  croee-polnt  arrangement. 
Sap  can  be  sst  wider  progressed  control  or  via  manual 
twitches  on  an  override  basla  to  provide  different 
configurations.  The  control  of  Smp  can  In  principle  be  by 
any  of  tha  proc'eeora;  one  processor  Is  assigned  tha 
control  at  any  ns  tine  by  manual  reconfiguration. 

*  Mp:  The  ahar  id  primary  memory,  Mp,  consists  of  (up  to) 

16  modules  of  <  up  to)  65K  16-blt  words.  Tha  Initial 
memoriae  being  used  havs  the  following  relevant  parameters: 
(1)  they  ere  core,  (2)  each  module  is  R-way  interleaved, 

(3)  access  time  Is  250  ns  and  cycle  time  la  650  ns. 

*  Skp:  Skp  allows  one  or  more  of  k  Unlbueaae  (the  conaon 
bus  for  memory  and  l/o  on  an  Isolated  POP-11  ayatam)  which 
have  eaveral  alow  or  fast  controllers  (Ka  or  Kf)  to  ba 
connected  to  one  of  p  central  procaeeore.  Tha  k  Unlbuaese 
with  tha  controllers  are  connocted  to  tha  p  processor 
Unlhuasss  on  a  fairly  long  term  baala.  Tha  mala  reasons 
for  only  allowing  a  long  term,  but  ewltchable,  connection 
between  tha  k  Unlbutias*  and  tha  processor  la  to  avoid  the 
problem  of  having  to  decide  dynaadcally  whin  of  tha  p 
processors  should  manage  a  particular  device.  Lika  Smp, 

Skp  may  ba  controlled  either  prograrewtlcally  or  manually. 

*  Pc  The  proceealng  elemanta.  Pc,  an  slightly  modlflsd 
versions  of  ths  DEC  PDP-11.  (Ths  several  models  of  tha 
PDP-11  may  be  Intermixed.) 

*  Damp:  The  Dmap  la  a  Oats  Operations  component  which 
takaa  tha  addraeeee  generated  in  the  processor  and  converts 
them  to  addressee  to  use  on  the  Memory  and  Unlbuaeae 
emanating  from  tha  Dmap.  Thera  are  four  eats  of  eight 
registers  in  Dmap,  enabling  aach  of  eight  R  192-byte  blocks 
to  be  relocated  In  tha  large  physical  memory.  The  site  of 
tha  phyalcal  Mp  la  2E20  words  (2E21  bytea).  Two  bite  in 
tha  processor  together  with  the  address  type  are  used  to 
specify  which  of  tha  four  eete  of  mapping  reglatara  la  to 
be  used. 

3. 1.1.2.  RANCE:  1-16  rmr*-r»  modulae  with  up  to  256K  words 
aach  (core  650  nsac  cycle  time),  1-16  PDP-11  procaeeore  (16 
blte/word),  16al6  croaabar  switch  with  An*  Wife  words /second 
capacity.  A  twc-procaasor,  two- memory  prototype  has  besn 
built  to  tsst  out  concepts  of  switch  and  software  design. 

3. 1.1.3.  CAPABILITY:  Tha  ayatam  should  bs  capable  of 
executing  3  to  15elOE6  instructions  par  second,  depending 
on  tha  PDP-11  processor  model.  A  RDP-10  can  execute 
roughly  3  to  15xlOL5  36-blt  Instructions  par  second. 

3.1.2.  EXECUTIVE  AND  0PERATINC  SYSTEM 
3. 1.2.1.  MODES,  and  3. 1.2.2.  SOF7VARI :  Although  the 
technology  of  operating  systems  has  made  significant 
progress  In  the  past  decade,  there  ere  few  systems 
constructed  specifically  for  multiprocessor  environments. 

In  particular,  no  eye  team  have  bean  built  to  support  the 
variety  of  process  relatione  (parallel,  pipeline,  etc.) 
envisioned  for  C,wmp.  Moreover,  there  la  a  relative  lack 
of  experience  In  organising  computations  for  parallel 
execution.  Theee  facta  have  driven  the  operating  syetev 
design  to  the  foll<**lng  conservative  position. 

Tha  operating  ayatam  will  conelet  of  a  "kernel'*  end  a 
"standard  eatenelon".  The  kernel  will  provide  a  eat  of 
mechanleme  (tools)  for  building  an  operating  ayatam,  but  no 
policies  (e.g.,  no  scheduler,  no  file  structur*,  no...). 

Tha  kernel  wi  ;  support  tha  (elMiltancoue)  execution  of  an 
(almost)  arbitrary  nushsr  of  eatenelon*. 


In  considering  what  sat  of  machanlexa  (tools)  should  ba 
provided  by  an  operating  ayatam  kernel,  two  coamonly  Feld 
views  of  tha  essential  nature  of  an  operating  system  era 
relevant: 
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■  An  operatiog  ayataa  creates  •  "virtual  machina"  to 
aupport  (uatr)  program*  by  providing  resource*  and 
oparationa  oot  praaant  In  th#  undarlylng  heriVare  (i.jt., 
"filae",  fila  "raad"  ant  "writ#"  oparationa,  ate.). 

■  An  operatiog  ayataa  la  a  raaourc t  (virtual  and  physical) 
■an a gar  and  allocator. 

Not#  tha  emphaal*  in  both  view*  on  raaourcaa,  thalr 
craatlon,  ittnegaaant,  and  oparationa  on  thaa.  Proa  thaaa 
vlava  we  iofar  that  on  appropriate  aat  of  tool*  for 
building  an  oparatlng  ayataa  auat  provide  for: 

*  Tha  creation  of  new  virtual  raaourcaa; 

*  Tha  "rapraaantation"  of  a  new  raaourca  in  tamo  of 
exist  log  onea; 

*  Tha  creation  of  oparationa  on  raaourcaa  and/or  thalr 
rapraaantation} 

*  Protection  (egelnat  illegal  oparationa  on  a  raaourca), 
uilformly  over  a  claaa  of  raaourcaa,  aa  wall  aa  with  regard 
to  apacific  inatancaa  of  a  ;• source. 

3.2 .  FAULT  TOLERANCE 

3.2.1.  F.voLTS  TOLERATEOi  Tha  ultimata  pool  la  to  ba  able 
to  tolerate  any  fault  lo  any  unit.  The  system  can  ba 
dynamically  reconfigured  via  tha  croaapolnt  awltch 
(dlaabliog  apacific  croaapolota)  and  vlo  power  avltchlng. 
Tha  detaction  of  and  recoverlop  fto*  failuraa  will  ba  a 
major  objective.  Aa  a  research  vehicle,  C.aep  will  allow 
the  study  of  fault-tolerant  hardware-software  interaction. 

3.2.2.  FAULTS  NOT  TOLERATEOi  Fault*  (parhapa  multiple) 
that  go  undetected  long  enough  to  mutilate  tha  Majority  of 
tha  copiaa  uf  critical  system*  tables  may  ultimately  laad 
to  an  antira  ayataa  creeh.  Early  detection  and/or 
prevention  of  this  clast.  of  faults  will  ba  cloealy  atudlad. 
Multiple  failuraa  in  tha  croaabar  awltch  might  alao  lead  to 
systaa  failure. 

3.2.3.  TECHNIQUES  ;  Tha  final  hardwsre/aof  twsre 
configuration  for  C.asp  la  far  from  stabilised.  H ewaver 
tha  following  tacholquaa  either  are  incorporated,  or 
provlalona  for  incorporation  have  bean  made,  or  (for 
Incremental  coat)  can  ba  iocorporatad. 

Tha  croaabar  awltch  la  bit  sliced  with  provialon  for  a 
Kammlog  coda  on  tha  data  blta.  Spare  bit-plar*  switching 
or  f sult-mmsktog  tadimidancy  can  ba  employed.  Switch 
failures  appear  a*  either  a  memory  or  proceeaor  failure. 
Thaea  failuraa  can  be  tolaratad. 

Bwaeas  can  femetloo  properly  »**•*  e  coeporveet  coonstted  t c 
it  h aa  p«*»er  rn»*»d.  Memory  modula*  are  organised  #s 
banks  so  that  s  memory  fallwra  simply  rarwve a  part  of  the 
nerwvry  arece. 

*e«wrr  and  address  parity.  Tabla-drlven  operating  syete*-* 
raw  b#  writtaa  «*>1  c*»  allow  pracafel  degradation  ft  os 
f si  lure  In  a  memory  or  le  e  proceeaor  modula  (removing  s 
reawrrs  'roe  availability).  Software  recalculation 
"wit  I  gle  coelaa  of  t^mee  critical  e.ateme  taMae  u|  11 
meslsf  fallvr*  tolstanca  nr  racevery. 

Mterset  tvely ,  critical  computation*  eight  ba  performed  b» 
twr>  dletlert  aerhoda  ultMe  a  single  processor.  Diagnostic 
programs  cam  Vs  r«  Just  baf*re  crttlcsl  romrut at tona  mtm 
to  *■•*  per  term* 4,  •<  f  I  sad  tstervsls,  er  slnpl*  vm.*n*wr  th» 
grmvaeeer  ta  mot  « cwf lid  utth  other  task*. 

1.1.  KM'lTl  i  The  distributed  nature  of  operetlog  svstesa 
ellawa  for  fault  roleremce  wlth**t  maaalva  •rpeodlturae  *ot 
e  par  1  f  |  <  ber^sra.  Software  cam  ba  devised  to  ftaictlon 
•Itbsvt  faulty  waits.  Critical  calculations  ran  eaellg  be 
recalculated  for  c^acblsg  purposes.  A  fault?  <*-11  le 
seal  1?  leelata!  via  tha  (tnssbar  switch. 

A*  astemalmm  to  the  decodlag  pmcase  for  a  single  ertor 
rovrert  tng/dsvdle  error  detecting  PeMtfng  cod*  to  enable 
fedla  error  carter:  lem  ha*  beam  In  wet  1  gated. 

).*,  UnitXTli  Many  yrstUai  effort#  haw  Had  influence 
e*>  tha  design  but  rhara  la  no  alogla  »a‘or  tnf’tmnce. 

1.3.  Thera  la  only  one  portion  of  tha  avatar 
tbhfc  la  set  replicated— the  croaabar  suite*.  Tha  switch 
b*a  b«*n  daalgmad  ao  that  failuraa  appear  either  as  a 
••err*  nr  e  processor  failure.  Bit  slicing,  Hamming  codes, 
•nd  fault -mask  tng  redundancy  can  help  to  lncraaae  the 
euttc*  rellabllltr. 


4.  JUSTIFICATION 

4.1.  RELIABILITY  EVALUATION l  Reliability  will  ba 
earluttod  via  analysis, 

4.2.  COMPLETENESS:  Evaluation  not  yat  floiahad. 

4.3.  OVERHEAD:  To  data  tha  hardware  for  fault  tolerance 
is  certainly  leas  than  5Z.  However  tha  deelgn  will  evolve 
and  quite  probably  raise  this  percentage.  Software  coat 
(in  axacutlon  time)  la  difficult  to  estimate  at  this  time. 

4.4.  APPLICABILITY;  Any  cult  IP  croaabar  configuration. 

4.3.  EXTENDAB1L1TY:  System  cannot  ba  expanded  bayoud  16 
wwrlii  and  16  processors  without  a  oew  croaabar  awltch. 

4.6,  CRITICALITIES:  Analyaia  ahova  that  th*  selection  of 
th*  memory  cycla  time  and  nurt>ar  of  processor*  greatly 

af facta  ayataa  performance  and  coat-af factlvanaaa.  Conten¬ 
tion  in  th*  croaabor  switch  llalta  ultimata  performance. 

4.7.  IMPLICATIONS:  Programmers  must  ensure  that  their 
syatam  la  correct,  even  under  conditions  of  asynchronous 
process  cocmunlcatlon. 


5.  CONCLUSIONS 

3.1.  STATUS:  First  portion  of  the  hardware  ayataa  should 
ba  completed  by  tha  and  of  auamer  1973.  Portions  of  Hydv* 
(the  oparatlng  syatam)  are  operable. 

5.2.  EXPERIENCE:  Non*  to  report  yat. 

5.3.  FUTURE:  In  tha  lMadlat#  future  C.mp  will  bi 
brought  up  aa  a  raaearch  tool  for  tha  Computer  Sell  oca 
Department .  Aa  a  research  tool  It  will  moat  likalj 
continue  to  evolve  In  daalgn. 

5.4.  ADVANCES:  Of f-ths-shal f ,  plug-compatlbla  fault- 
tolerant  (or  at  least  salf-chacklng)  covenants  would  ba 
wrv  deelrahle.  Aa  liardwere  bacomaa  cheaper,  the  capacity 
of  modules  become  larger.  And  with  LSI  the  Insides  of  a 
modula  ar*  not  even  accessible.  Hence  building  fault- 
tolarant  ays  tame  with  off-the-shelf  componaota  without  self 
checkin*  or  faul'  tolarant  features  la  vary  Inefficient. 
(Othar  than  duplication  and  compariaon,  or  triplication  and 
voting,  little  slaa  is  available  to  tha  system  designer.) 

System  validation  ( integrated  hardware  and  software)  la 
another  Important  •?*.  Alao  daalrabla  would  ba  a 
methodology  for  de«imlt«  a  fault  tolerant  ayataa.  Which 
fault  tolerant  techniques  cv'ep lament  each  othar?  Finally, 
awitchae  for  rsconflgt  ratten,  awltch  control,  and  fault 
tolerant  switch  design  era  areas  requiring  further  etudy. 
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SURVEY  OF  FAULT  TOLERANT  COMPUTER  SYSTEMS 
Oonald  C.  Wallace 

Stanford  Research  Institute,  Menlo  Park  Ca,  June  72 

1.  IDENTIFICATION 

1.1  NAMEsCOMEX-  Online  order  handling  ayitea 

1.2  RESPONSIB1LTY  :  P.C. Service  Corp.  Uubaidisry  Pacific 
Cosst  Stock  Exchange) 

1.3  SUPPORT:  Member  firms  of  PCSE 

1.4  PARTICIPANTS:  Member  firm*  of  PCSE 

1.5  START:  Contract  let  -  17  November  1967 

1.6  COMPLETION:  System  accepted  -  4  December  1969 

1.7  BIBLIOGRAPHY:  The  moat  acr  .rate  desriptlon  of  the 
COMEX  Is  the  final  doeumentsiun  delivered  with  the  system, 
Ooeuments : 

Specification  for  data  processing  and  comnunlcation 
^^ulpmeni  for  Pacific  Coaat  Stock  Exchange  PC  Service 
Corp.,  1967 

Proposal  for  Real-Time  Order  Handling  System  BBN 
#p6P-  bE-01 ,4  August  1967 

contract  for  Real-Time  Order  Handling  System  for  Psc4flc 
Coast  Stock  Exchange  BBK'/PCSE,17  Noventoer  1967 

2.  MOTIVATION 

2.1  PURPOSE:  Real  time  odd-lot  order  execution 

2.2  PHYSICAL  ENVIRONMENT:  Ground  baaed 

2.3  COMPUTING  ENVIRONMENT:  The  system  serves  two  trading 
floors,  one  In  Los  Angeles,  the  other  in  San  Francisco. 

2.4  COMPUTING  OBJECTIVES:  COMEX  la  designed  to  handle 
virtually  all  low-speed  teletype  speeds,  lavala  and  codes. 
It  appears  as  a  node  on  each  of  tha  connectad  broker  firm* 

'-nriunlcatlon  networks  and  oust  conform  to  the  line 
protocols  and  hardware  constraints  of  that  natwork.  The 
design  objectives  were  for  64  ‘'nodes"  In  LA.  and  64  in 
SF.,  and  for  a  maximum  meaaage-svi tchlng  traffic  of  25,000 
ordera/tranaactions  per  day. 

2.5  RELIABILITY  OBJECTIVES:  Tha  system  was  designed  to 
provide  99T*  uptime  and  with  s  no  "message  lost"  criteria. 

2.6  DYNAMIC  VARIABILITY:  The  »v«tem  is  designed  so  that 
order  entry  is  performed  In  rvai  time,  but  the  order 
execution  process  may  lag  an  arblttary  period  of  time.  In 
operation  this  lag  never  exceeds  20  minutes  (approx.??). 

2.7  PENALTIES:  COMEX  has  various  degrees  of  degradation, 
the  ultimate  being  total  manual  oparation  and  execution  of 
the  orders  by  the  specialists  on  the  trading  floors. 
Esoteric  software /hardware  malfunctions  could  cause 
extremly  large  manual  Intervention  problems  as  the  system 
is  re 8 1 1  v  buying  and  selling  stock  on  the  behalf  of 
members  of  the  exchange. 

2.8  CONSTRAINTS:  The  PCSE  Is  really  two  exchanges  with  two 
different  trading  floors,  one  In  Los  Angeles  and  one  In 
San  Franclaco.  For  reliability  reasona  the  system  Is 
fully  redundant.  A  PCSE  constraint  on  the  ayatem  wss  that 
the  system  be  equslly  split  between  the  two  sites. 


3.  OESCRIPTION 

3.1  ARCHITECTURE 

3.1.1  CONFIGURATION 

3. 1.1.1  INTERCONNECTIVITY:  See  diagram  which  shows  the 
twin  IBM  360  computers  and  the  680  systems  each  of  which 
Includes  s  DEC  P0P8  computer. 

3.  1.1.2  RANGE:  The  system  Is  really  two  systems  running  In 
parallel.  It  Is  sensible  to  run  them  as  ainRle  units  or  s 
fully  redundant  system.  Two  configurations  are  possible :- 
Non-partitloned  trading  floors: 

LA-remote660,  SF-local680  snd  SF-360 
SF-remote680,  LA-local680  and  LA- 360 
Partitioned  trading  floors: 

SF-locsl680  snd  SF-360 
LA-local680  and  LA- 360 

3. 1.1.3  CAPABILITY:  COMEX  consists  of  two  (2)  360/50 
computers  plus  the  front-end  cownunications  systems. 


3.1.2  EXECUTIVE  and  operating  ayatem:  COMEX  runs  under 
IBM/360  DOS  with  Its  fixed  number  of  multiprogram 
partitions  option. 

3. 1.2.1  MOOES  of  operation:  The  order  execution  process 
runa  in  a  high  priority  partition  of  DOS  while  normal 
operation  of  PC  Service  Corp.  computer  operations  are 
being  run  in  other  "foreground"  snd  the  background 
partitions.  The  communication  process  (in  the  680's)  is 
dedicated  And  allows  no  jther  functions. 

3.1. 2. 2  SOFTWARE  orge. i cation :  Basically  the  680’a  do 
character  assembly  (bits),  line  protocol  Interpretation 
(answer  back,  echo,  etc...),  message  aegmant  assembly,  1/0 
buffering,  transmission  to  locsl  and  remote  360's.  The 
360's  do  message  switching,  code  translation,  message 
decoding  (syntax  analysis),  order  queuing,  decoding  of 
NYSE  and  AMEX  tickers  (Identify  trades),  execute  queued 
orders,  send  confirmations  to  broker  and  specialist. 

3.2  FAULT  TOLERANCE 

3.2.1  FAULTS  T0LERATE0:  Essentially  the  system  will 
tolerate  sny  or  all  failures  in  a  single  system  (l.e., 
backup  or  primary). 

3.2.2  FAULTS  NOT  TOLERATEO:  Any  simultaneous  failures  In 
both  the  primary  and  backup  system  causes  loss  of 
integrltry  of  the  data  files.  This  is  considered  s 
cstsatrophic  event  and  some  manual  correction  and 
Intervention  for  order  execution  and  notification  will  be 
needed.  (To  my  knowledge  this  hsa  only  occured  once  in 

almost  three  years  of  operation.) 

3.2.3  TECHNIQUES: 

HARDWARE:  The  COMEX  system  Is  completely  redundant 
(two  of  everything),  snd  both  systems  run  In  rarallel. 

The  major  design  criteria  was  that  nothing  should  happen 
In  one  system  half  that  could  sdversly  effect  the  other. 
This  led  to  the  system  interconnections  (PCU,  being 
unidirectional  snd  step-locked  in  a  "here's  a  ord,  take  s 
word"faahion.  All  7YY  connections  to  the  ayot  m  are  dusl 
dropped  and  thete  la  a  hardware  interlock  to  prevent  both 
680  machines  from  outputing  to  s  line  at  the  same  Hum, 

SOFTWARE:  The  software  a  designed  to  be  very 
modular,  and  no  control  flow  exlrcs  between  functional 
routines.  Control  flow  la  betv  *n  the  COMEX  scheduler/ 
executive  and  each  functional  module.  Data  is  passed  from 
function  to  function  by  means  of  stacks  and  lists,  and 
standard  ayatem  global  routines  are  used  to  accomplish 
this.  Both  ayatem*  are  actually  performing  the  entire 
order  execution  task  in  parallel  and  there  is  really  no 
communication  between  them.  The  on)  difference  la  that 
the  "backup"  system  la  not  outputinr-,  ^action 
confirmations  and  order  receipt  notu/i  .iona.  The  backup 
ayatem  maintains  u  queue  of  the  last  "u"  messages  to  each 
line  in  the  system.  When  switch-over  oocura,  these 
messages  are  output  to  the  apecialiata/bt nkers  with  a  "may 
be  duplicate"  tag. 

3.3  NOVELTY:, The  Interconnection  of  the  0EC  680's  snd  the 
S/360'a  la  accomplished  without  requiring  modifications  or 
additions  to  the  IBM  operating  system  or  piovldlng 
"special"  1/0  modules.  The  680' s  (two  of  them)  have  a 
S/360  channel  equivalent  (FCU)  that  talks  to  the  IBM  2841 
disk  controler  with  the  two  channel  feature  (8100).  Thl" 
is  the  equivalent  of  having  two  360  systems  talking  to  on*- 
disk  ayatem.  This  is  a  standard  IBM  configuration 
possibility  (though  not  supported  hy  IBM  software).  If 
the  user  la  willing  to  accept  implementing  his  own 

read  'writs  lock  mechanises  there  ia  nothing  in  the  IBM 
ayatem  to  preclude  this  mode  of  operation.  Given  all  of 
the  above  it  ia -now  poujlble  to  write  a  communication* 
ayatem  strictly  at  the  user  level  using  standard  IBM  1/0 
software.  Osta  Just  "appears"  on  the  disk  and  ia  read 
into  the  360  and  is  in  turn  written  on  the  disk  snd  Just 
"disappears".  The  data  from  the  680'a  is  written  as  s 
aaqueotla.ly  ever  growing  file,  capturing  an  entire  day's 
transactions.  This  allows  "rerunning"  a  day's 
transactions  in  real  time  to  find  obscure  bugs. 

3.4  INFLUENCES:  After  spending  several  yeara  working  on 
modified  or  bastard  360  Systems  and  realizing  the  effort 
level  to  maintain  these  systems  given  the  frequency  of  new 
IBM  releases,  it  seemed  Insane  to  design  s  system  that 
relied  on  any  thing  except  the  most  rudimentary  features 
of  the  IBM  monitor.  The  approach  described  has  proven 
very  successful  In  over  three  years  of  operation.  To  my 
knowledge  no  problem  have  been  encountered  due  to  the 
monitor/  Comex  system  interface. 


4.  JUSTIPICAT1C* 

4.1  RELIABILITY  EVALUATION:  The  lyicn  hn  Mt  and 
exceeded  cha  design  criteria  over  the  last  2  years  of 
operation. 

4.3  OVERHEAD:  Since  the  system  la  totally  redundant .  at 

least  half  the  coat  of  the  covtalcatlons  front  end  Is  due 
to  reliability  requirements.  The  reliability  requirements 
of  the  system  probsfc  did  not  contribute  significantly  to 
the  software  design.  *  *  probably  helped  In  the  checkout 
and  operational  phase*  ( 

4.4  APPLICABILITY:  The  system  has  general  applicability 
for  coanail cat  ions  and  message  switching  systems  where  the 
base  computer  facility  must  be  IBM  (for  what  ever 
reasons;.  It  offers  significant  coat  savings  when 
compared  to  an  equivalent  all- IBM  equipment  configuration. 
Its  novel  interfacing  technique  allows  the  users  to 
concentrate  on  the  application  program  and  offers  long¬ 
term  savings  In  effort  by  not  having  a  modified  IBM 
operating  system.  The  system  has  specific  applicability 
to  other  small  or  moderate  sited  stock  exchanges  both  U.S. 
and  foreign. 


4.5  EXTEHDABULITY:  There  appear  to  be  no  obvlotr: 
retentions  to  the  system  aa  far  as  capacity  is  concerned. 
Two-  or  three-fold  Increases  in  throughput  are  possible 
whereas  factors  of  ten  are  out  of  the  question.  Sinew  the 
next  obvious  exchange  automation  taak  is  either  NYSE  or 
AHEX  end  the  voltae  of  message  traffic  for  those  exchanges 

•ta*»ering.  COHEX  moat  certainly  has  no  real  logical 
extension  for  these  situations.  Specific  experience  and 
techniques  in  decllng  with  automation  of  a  stock  exchange 
process  may  have  general  applicability. 

4.6  CRITICALITIES:  A  specific  goal  in  harArere  design  not 
to  exceed  the  Mstate  of  the  art"  was  lapoeed  by  PCSE  to 
gain  assurance  of  reliability.  This  constraint  caused  the 
•election  of  har*«are  that  most  assuredly  is  obsolete  by 
today's  standards  (e.g.,  bit  aerial  T T  Interface) 
greatly  restricting  overall  I/O  rapacltv  (like  *  e  a 
factor  of  10). 

5.  cOKLUSl 3NS 

5.1  STATE’S:  The  system  is  currently  handling  15X  of  the 

message  switching  capacity  of  25,000  order 
transactions  per  day.  It  Is  undergoing  significant 
modification  to  handle  ro«sid-lot  traffic,  which 
potentially  will  Increase  load  to  50X  of  capacity  within 
the  next  IR  aonths.  Studies  are  txiderway  to  evaluate 
high-speed  I/O  capability. 

5.2  EXPERIENCE:  Overall  system  operation  has  been  highly 
satisfactory  to  the  PCSE. 


COMEX  SYSTEM  -  PACIFIC  COAST  STOCK  EXCHANGE 
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SlRVEY  OF  FAULT-TOLERANT  COMPUTING  SYSTEMS 

John  H.  Wensley,  Stanford  Research  Institute 
i>olo  Park,  Ce.  94025,  Hay  1972 

I.  IDENTIFICATION 

1,1.  NAME:  SIFT  (Softvare-Implamentad  Fault  Tolaranca) , 
projact:  daalgn  atudy  of  a  fault  tolarant  digital 
computer 

1.2  RESPONSIBILITY:  SRI 

1.3  SUPPORT:  NASA  Langlay 

1.4.  PARTICIPANTS:  J,  Goldbarg,  K.  Lavltt,  R,  Ratner,  J. 
Want  ley,  H.  Zeldlar,  H.  Crean 

1/,  START:  August  1971 

1.6.  COMPLETION:  Experimental  veralon  1973,  final  daalgn 
1974 

1.7.  BIBLIOGRAPHY:  Tachnlcal  Prograas  Narratlvaa  1-7 j 
"SJ.FT  -  Software  Implemented  Fault  Tolaranca," 

'  FJCC  1972 


2.  MOTIVATION 

2.1.  PURPOSE:  Cootrol  procaaalng  In  an  advanced 
technology  transport  (aircraft)  locludlng  navigation, 
stability  augmentation,  angloa  control,  inetriamot  blind 
landlr.ga,  etc. 

2.2.  PHYSICAL  ENVIRONMENT:  Airborne  —  the  ayetam  concept 
howavar  la  applicable  to  any  environment. 

2.3.  COMPUTING  ENVIRONMENT:  Real-time 

2.4.  COMPUTING  OBJECTIVES:  Configuration  acelaablllty , 
graceful  degradation,  tranaportablllty  of  concept  to  any 
proceaaor  or  memory  deelgn. 

2.5.  RELIABILITY  OBJECTIVE?-  Minimum  probability  of 
erroneous  resulte,  and  of  loaa  of  computing  capacity 
•luring  aircraft  flight. 

2.6.  DYNAMIC  VARIABILITY:  Variable  dagreee  of  fault 
tolaranca  for  tasks  of  dlffarlog  criticality.  Ability  to 
trade  off  o  tvean  computing  power  and  fault  tolaranca. 

2.7.  PENALT.ES:  Woret  case  -  human  lives;  Intermediate  - 
aircraft  damage;  least  cesa  -  oeed  to  abort  flight 
objectlvee. 

2.8.  CONSTRAINTS:  Hardware  must  ba  designed  with  weight, 
size  and  power  requirements  consistent  with  aircraft 
require manta.  The  basic  concept  of  the  ayetam  la  only 
affected  .  •  the  constraint  that  malntanaca  cannot  ba 
carried  out  during  flight. 

2.9.  TRADEOFFS:  Computing  capacity  vs.  reliability 


3.  DESCRIPTION:  A  system  architecture  In  which  fault 
.  tolaranca  Is  achieved  with  no  special  fault-tolarant 
hardware. 

3.1.  ARCHITECTURE:  A  multi- computer  (saa  Fig  1) 

3.1.1.  CONFIGURATIONS:  No  constraints  ara  present  on 

processor  or  memory  daa  Fault  tolaranca  la  achieved 
by  tha  raatrlctad  conn.'  cf  processors  and  memories, 

and  by  software  control. 

3. 1.1.1.  INTERCONNECTIVITY:  Procaaalng  modulae  comprising 
a  processor  and  memory  ara  connected  via  multiple  buseaa. 
Tha  intarconoactlon  la  daalgnad  ao  that  processors  may 
only  read  (and  not  writs)  Into  tha  memory  of  othar 
modules.  The  busaee  ara  used  as  eltarnatlva  routae  rathar 
than  as  multiple  simultaneous  transmission  paths. 

3.1. 1.2.  RANGE:  Tha  scale  of  tha  ayetam  la  not  frozen  in 
tha  erchltactural  coocept.  It  la  anvlaagad  tha:  a  minimum 
configuration  would  contain  threa  procaaalng  mounlaa  and 
thraa  buaaea.  The  design  does  not  (at  present)  place  any 
limit  on  tha  maximum  configuration.  Greater  fault 
tolaranca  la  achieved  with  e  lerge  number  of  low- 
cepeblllty  unite  rathar  than  with  a  email  number  of  high 
capability  unite. 


3. 1.1.3.  CAPABILITY:  Tha  daalgn  concept  la  valid  over 
tha  antlra  ranga  of  proceaaor,  memory  and  bus  epabillty. 

3.1.2.  EXECUTIVE:  Executive  control  (allocation, 
echaduling,  dispatching,  reconfiguration,  ate.)  is 
achieved  by  replicated  software  executive  routines. 

3. 1.2.1.  MODES:  Tha  primary  operating  moo  Is  on 
repetitive  real-time  calculations  lnvolvlog  m^-y  looealy 
connected  tasks.  Both  multlpr oceealng  and 
multlprogramlng  are  Included. 

3. 1.2. 2,  SOFTWARE :  Tasks  -re  nultlprogramed  In  aach 
procaealng  module.  Etch  teek  for  which  fault  tolaranca  la 
demanded  Is  present  In  more  than  ona  module.  A  loose 
eynchroolzatlon  of  task  procaaalng  la  achieved  by  tha 
ayetam  executive  (which  itaalf  la  replicated  and  loosely 
synchronized) .  Software  fault  datectloo  la  carried  out 
batwaan  aach  iteration  of  a  taak  bafora  arronaous  rasults 
ara  used  by  tha  naxt  iteration  or  othar  tasks. 


Hkiik  1  Svhlrm  Configuration 
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3.2  FAULT  TOLERANCE 

3.2.1.  FAULTS  TOLERATED:  Tha  system  la  tolerant  to  faults 
in  any  unit  (processor,  bus  or  memory).  The  faults  msy  bs 
the  erroneous  result  of  an  action  (calculation, 
transmission  or  storage)  or  the  failure  of  a  unit  to  carrv 
out  any  action.  7 

The  system  handles  transient,  and  permanent  fau  ts, 
treating  long-tarm  intermittent  faults  as  perms  ent.  The 
reconfiguration  procedures  can  bring  back  Into  nrvlce  a 
unit  that  was  at  one  time  subject  to  faults  bu  l  as  slnca 
recovered  or  been  repaired. 

The  causa  of  the  fault  (electrical,  mechanical,  etc.)  Is 
not  of  importance,  the  only  consideration  Is  hether  the 
results  of  actions  in  replicated  units  agree  or  disagree. 

Independent  multiple  faults  can  be  tolerated  to  any  degree 
depending  on  the  extent  of  replication  of  the  function. 
Correlated  faults  both  in  hardware  and  software  are  not 
toler/  ed  to  the  same  extent  as  uncorrelated  faults.  The 
loose  synchronization  of  tasks  assists  in  tolerating 
faults  which  are  correiaud  In  time  rather  than  function. 
One-ahot  faults  do  not  cause  removal  or  reconfiguration  of 
units  from  the  system.  The  propagation  of  a  fault  from 
*ny  unit  to  another  can  only  occur  if  both  units  are 
faulty. 

3.2.2.  FAULTS  NOT  TOLERATED:  Multiple  correlated  faults 
that  are  not  detacted  by  a  voting  procedure,  or  by 
rapeating  the  task,  e.g. ,  simultaneous  identical  failure 
of  two  memory  units  when  threefold  replication  is  used. 
Passive  faults  that  reduce  the  system  to  a  size  too  small 
to  handle  the  computing  load. 

3.4..  3.  TECHNIQUES:  Fault  detection  is  carried  out  by 
replication  and  voting.  Other  fault  detection  methods 
(hardware  or  software)  are  compatible  with  and  can  be 
incorporated  into  the  system  concept.  Fault  correction 
(or  tolerance)  is  achieved  by  voting  after  replication  in 
moat  cases  but  can  ba  supplemented  by  other  techniques 
surh  as  repetition  or  roll-back.  The  allocation  of 
resources  to  tasks  can  be  changed  either  when  faulty  units 
are  removed  or  when  •‘ha  mission  demands  different  fault 
to-aranca  and/or  cor  utational  power. 

3.3.  NOVELTY:  Lack  of  nead  for  special  hardware  units  to 
facilitate  fault  tolerance.  Ability  to  trade  off  fault 
tolerance  with  computing  power.  Applicability  of  the 
system  concept  to  different  memory  or  processor  designs. 


3.4.  INFLUENCES;  The  design  is  influenced  by  the  nead  to 
avoid  special  hardware  for  fsuit  tolerance,  freezing 
fault  tolerance  techniques  at  design  time,  designs  geared 
to  pertlcular  size  and  spead  computers. 

3.5.  HARD  CORE;  I  don’t  mean  anything  by  "hard  core"  in 
th.  .yste.  described.  I  esn  imagine  other  system  concepts 
in  which  the  term  l.«s  meaning  (but  little  utility). 

4.  JUSTIFICATION 

4.1.  RELIABILITY  EVALUATE;  By  analysis,  assuming 
uncorreiated  faults  of  equ.-l  probability  in  each  part  of 
the  system  (chip*  connector,  cable*  etc.). 

4.2.  COMPLETENESS  OF  EVALUATION:  Incomplete. 

4.3.  OVERHEAD:  Variable,  typically  e  3-1  cost  penalty  is 
paid  for  fault  tolerance. 

4.4.  APPLICABILITY:  General;  the  design  is  applicable  to 
any  environment, 

4.5.  EXTENDAB1LITY:  Unlimited. 

4.6.  CRITICALITY:  Multiprocessing  is  critical. 
Multiprogramming  is  highly  desirable  (see  Fig  2). 

4.7.  IMPLICATIONS:  There  are  no  implications  or  tha 
hardware  designers  of  processors  and  memoriae..  The  busaas 
are  constrained  in  tha  way  units  communicate.  The 
applications’  software  must  be  implemented  so  that  input 
data  for  s  program  is  fetched  by  calling  s  general  system 
routine  which  carries  out  fault  det.ction  and  correction. 

5.  CONCLUSIONS 

5.1  STATUS;  A  conceptual  design  of  hardware,  software 
and  fault  tolerance  procedures  exists. 

5.2.  EXPERIENCE;  Software  design  studies  show  that  the 
time  and  memory  requirements  of  the  fault  detection  and 
correction  routines  are  reasonable. 

5.3.  FUTURE:  The  projection  is  for  an  experimental 
version  of  tha  system  to  be  built. 

5.4.  ADVANCES:  I/O  units  with  fault  tolerance 
capability. 
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Figure  2  An  Example  of  Task/ Processor  Alloca 


SURVEY  OF  FAULT  TOLERANT  COMPUTING  SYSTFHS 

Robert  K.  Willi***,  PI****)-  Tslscoraunicstlona  Research 
Ltd.,  Tsplow  Nr.  Maidenhead,  Barit*.,  U.K.,  October  1972. 

1.  IDENTIFICATION 

1.1.  NAME:  System  250 

1.2.  RESPONSIBILITY:  Tlte  Plssaey  Co.  Ltd. 

1.3.  SUPPORT:  System  dsvelopment  is  jointly  supported  by 
The  Plssssy  Co.  Ltd.  cr.d  the  Nstionsl  Resssrch  end 
Development  Corporation, 

1.4.  PARTICIPANTS  Ths  Plessey  Co.  Ltd. 

1.5.  START:  January  1969 

1.6.  COMPLETION:  Prototyps  completed  end  of  1071. 

1.7.  BIBLIOCRAPHY:  The  following  four  papers  are  contained 
in  the  procsedlngs  of  the  International  Svitf  log 
Symposium,  M.I.T.,  Cambridge,  Mass.,  L’.S.A.,  t>-9  Jt*>e  1972. 
*D«  M.  England,  Operating  S/ststi  of  Systse  250. 

*J.  M.  Cotton,  The  Operstlonsl  Requirements  for  Future 
Conemicstion  Control  Processors. 

>0.  Halton,  Hardwsrs  of  ths  Syatsa  250  for  CoMaunication 
Control. 

•W.a.C,  Heaatinga,  Telephone  Switching  baasd  on  System  250. 

The  following  fiv*  papsra  are  contained  in  ths  procesdinga 
of  ths  I.E.N.E.  Conf.  on  Computers-  System*  and  Technology, 
Middlesex  Hospital  Med.  School,  London,  U.K.  24-27  nct  72. 
*R.  K.  Williams,  System  250  -  Basic  Concepts. 

*M,  J.  Coodier,  System  250  -  Processing  Philosophy. 

*P.  C.  Venton,  System  250  -  Input/Output. 

*R.  J.  teaman,  System  250  -  Security  Philosophy. 

*G.  Edgs,  Syatsa  250  -  Diagnostics. 

Ths  following  four  papers  sppssr  in  ths  procssdinga  of  the 
International  Confsrsncs  on  Computer  Coswmication, 
Washington  D.C,,  U.S.A.,  24-26  October  1972, 

*D.  C.  Cosserat,  A  Capability  Oriented  Multl-procsaaor 
Systsa  for  Real-Time  Applications. 

•K.  H.  Hamer-Hodgea,  Fault  Resistance  and  Recovery  Within 
System  250. 

*C.  S.  Repton,  Reliability  Assurance  for  System  250,  A 
Rs liable  Resl-Tima  Control  Systsm. 

*,I.  Crompton,  Structure  and  Intsmsl  Communications  of  s 
Telephone  Control  System. 

2.  MOTIVATION 

2.1.  PURPOSE:  Stored  program  control  of  telephone  and  dsts 
switching  systems. 

2.2.  PHYSICAL  ENVIRONMENT:  Cround  baasd 

2.3.  COMPUTING  ENVIRONMENT:  Ths  system  la  designed  to 
allow  flexible  interaction  with  its  environment  e.g. 
locslly,  remotely  and/or  via  s  network. 

2.4.  COMPUTING  OBJECTIVES:  Ths  computing  objectives  are  not 
vs II  defined  in  any  absolute  sens*.  Systsm  performance 
must  be  adequate  to  ensble  vsry  large  telephone  sxchanges 
to  be  adequately  controlled,  yet  ths  coat  of  tha  smallest 
secure  configuration  should  bs  minimised  to  allow  economic 
control  of  small  exchanges.  The  system  architecture  should 
silos?  saay  expansion  of  an  Initial  configuration  by  a 
factor  of  three  or  mors  whilst  the  system  la  on-line.  Such 
expansion  could  be  in  terms  of  processing  powar  end/or 
storage  capacity  and/or  Input/output  capacity  or  any 
permutation  thsreof.  (Ses  slao  2.R.) 

2.5.  RELIABILITY  OBJECTIVES:  The  ayatsm  was  deigned  with 
ths  aim  of  meeting  ths  reliability  requirements  proposed  by 
ths  British  Post  Office  for  application  to  telephone 
control  equipment.  Thsae  requirements  ware  defined  on  a 
slidir.g  scale  which  rslated  duration  of  a  single  system 
failure  to  tha  maximum  acceptable  mean  frequency  of 
occurrence  of  similar  failure;. 

Failure  Duration  Hex.  acceptable  mean  frequency 
20raa  50  per  ye*r 

15a  12  per  year 

**  1  p*  r  year 

5  min  1  par  20  yssrs 

10  min  1  par  50  years 

For  the  purposes  of  the  above,  a  system  failure  ia  defined 
as  a  fault  which  affseta  more  than  half  of  the  controlled 
equipment.  /Note:  Average  duration  5  seconds/. 


2.6.  OYNAHIC  VARIABILITY:  Both  performance  and  degree  of 
fault  tolerance  may  be  varied  at  will  by  aimply  adding  or 
subtracting  ayatsm  modules.  Addition  of  modules 
simultaneously  iocreassa  both  performancs  and  reliability, 
thus  the  question  of  trade-off  does  not  arias. 

2.7.  PENALTIES:  Faulty  operation  will  obviously  degrade 
perfnrmance  which  mny  wall  lead  to  loea  of  revenue  and  in 
extreme  circumstances  could  involve  loaa  of  life  s.g.  if 
emergency  telephone  calls  fail  to  gat  through  etc. 

2.8.  CONSTRAINTS:  Thera  are  now  wall  defined  constraints 
on  airs,  weight,  power,  coat  ate.  In  shaolute  tsraa.  The 
Aim  haa  bean  to  produce  a  ays  tarn  which  la  very  competitive 
In  terms  of  the  above  parameters  with  contemporary  ayatsme 
but  offer*  very  much  enhanced: 

*  Reliability  and  Security, 

*  Ease  and  Range  of  Expansion, 

*  Flexibility  in  terme  of  being  able  to  tailor  the  hardware 
and  anftvars  configuration  to  cloasly  match  particular 
requirements. 

2.9.  TRADEOFFS:  Computing  capacity  &  reliability  va,  coat. 

3.  DESCRIPTION 

3.1.  ARCIIITEC.TJRE 

3.1.1.  CONFIGURATIONS 

3.1. 1.1.  INTFRCONNECTIVITY  (See  Figure):  The  basic 
her<kere  constraint*  on  system  intsrconnectivlty  (aside 
from  any  additional  constraints  imposed  by  aoftwsra)  sra 
described  below. 

Each  processor  unit  hss  its  own  dsdicated  communications 
bus  for  comrunicating  with  atore  or  the  input/output 
network.  No  processors  will  bs  directly  connected  togsther 
under  normal  circumstance*  although  this  ia  allowed  (via  a 
spsclsl  interface)  for  fault  diagnosis  purposes  only.  Any 
processor  can  access  any  atorsgs  location  and  sny  part  of 
the  input/output  systsm.  Store  modules  sra  connected  to 
all  processor  busss  via  multiport  access  unite. 

In  ayatsme  which  contain  sore  than  two  processors,  access 
to  the  input/output  system  is  achieved  via  two  multiport 
Bua  Multiplexors  which  multiplex  three  or  mors  Processor 
Buses  onto  two  Peripheral  Buses  (ons  psr  multiplexor). 

Fsat  peripheral  device*  srs  connected  directly  to  both 
peripheral  busss  via  two  port  Psrsllsl  Intsrfscs  Unit*. 

All  data  transfer*  between  tha  abovs  units  taka  pises  in  24 
hit  parallel  mode. 

Slow  speed  and/or  low  activity  peripheral  devices  sra 
connected  to  tha  system  via  a  asrial  communications  medium 
in  which  all  data  transfers  taka  pises  in  asrial  bit  form. 
Tha  Serial  Medium  ia  interfaced  onto  tha  Peripheral  Buses 
via  specialised  Parallel  Interfax  Units  known  as 
^•rial-Parallal  Adaptors.  Each  adaptor  has  two  ports  and 
la  connected  to  both  Peripheral  Buses.  Peripherals  are 
interfaced  onto  ths  SsrisI  Medium  vis  two  part  Ssrisl 
Interface  Units,  each  port  being  connsctsd  to  s  differsnt 
Ssrisl-Psrsllsl  Adapstor  vis  a  network  of  Ostr  Switches. 

Ths  pathway  between  a  Serial-Parallel  Adaptor  and  a  Serial 
Interface  Unit  normally  pasasa  through  a  64  port  Primary 
Oata  Swlthch  (of  which  there  la  ons  par  Serial-Parallel 
Adaptor)  and  than  through  a  16-port  Secondary  Oata  Switch 
to  the  appropriate  Serial  interface  unit.  The  secondary 
Oata  Switch  may  sometimes  be  omitted, 

Largs  systems  may  contain  several  Serial  Media  each  being 
connsctsd  onto  two  Peripheral  Busss  via  two  Ssrlsl-Psrallel 
Adaptors.  If  nscssisry  several  psira  of  Psriphersl  Buses 
could  slao  be  provided  vie  asvsrel  pairs  of  Multiplexors. 

If  there  are  no  mors  than  two  processors  :.n  a  system, 
Multiplexors  srs  unnecssssry  ana  Parallel  Interface  Units 
and  Serial-Parallel  Adapstora  may  be  connected  directly  to 
the  proesaaor  buses. 

3. 1.1. 2.  RANCE:  There  are  no  well  defined  upper  limits  on 
the  numbers  of  processors  and/or  atorss  poasibls  in  s 
•yatem,  but  present  estimates  indicate  that  system 
containing,  up  to  16  processors  and  perhaps  20-  30  store 
modules  are  feasible ,  Each  store  module  could  contain  up 
to  64K  of  24  bit  words.  Tha  smallest  ssnaible  system 
currently  envisaged  would  contain  a  single  proesaaor  and  s 
single  store  moduls  of  16K  or  24K  capacity. 

3.1. 1.3.  CAPABILITY:  Baaed  on  a  method  of  power  assessment 
which  has  bean  developed  specifically  for  the  telephone 
■witching  application,  a  single  processor  ayatsm  turns  out 
to  bs  sbout  one  third  or  ons  hslf  (depending  on  ths  type  of 
store  ussd)  ss  powerful  ss  en  IBM  360/65.  The  maximum 
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POMlble  nu*>er  of  fixad  point  additions  par  second  for  a 
•Ingle  PP250  processor  lies  between  about  500,000  and 
900,000  dapending  on  the  type  of  store  uaed  (vit.  850na 
core  or  300na  plat-ad  wire)  and  on  whethar  the  additlona  la 
•  atore  referanca  or  regiater  to  regiater  operation. 

MOPES,  The  ayatem  la  a  multi-CPU  system  with  all 
CPI  a  being  asynchronnu*  identical  unita  of  equal  atatim, 

All  Operating  System  modules  are  re-entrant  and  tliua  may  be 
executed  by  aeveral  proceaaora  simultaneously  and 
independently.  The  Operating  System  will  normally  be 
entered  by  a  auhroutine  call  but  can  alao  be  entered  aa  a 
■eault  of  a  program  trap  or  aa  the  reauit  of  a  tl»r 
maturing  within  a  CPU, 

The  System  la  multiprogranruibie  with  each  proceaaor  being 
riai  on  a  tlme-aharlng  baala.  Mult i-proceaalng  la  a  atandard 
feature  of  the  System  and  proceaaaa  can  be  riai 
inderr.idmtly  from  or  in  controlled  co-oparation  with  other 
Pioceaaec-. 

In  Sya tern  250  a  proceaa  la  «  dynamic  entity  and  la  defined 
aa  the  execution  of  program  code  on  a  particular  »et  of 
input  data.  because  of  tha  protection  afforded  by 
Capabilities,  many  processes  can  aafely  share  a  particular 
block  of  program  code  aimult  tneoualy ,  but  each  will  execute 
it  on  a  different  aet  of  data  A  proceaa  may  only  be  run 
on  one  procaaaor  at  a  else  but  may  in  general  run  on 
aav?r*l  different  proceuaors  consecutively. 

Since  i  la  data  and  not  coda  which  distinguishes  one 
proceaa  fro*  another,  processes  are  allowed  to  cross  the 
conceptual  boiaidary  between  Operating  System  and  user 
program*  in  Juat  the  ai  ne  way  as  they  would  cross  the 
boundaries  between  ind*. s' 1  dual  user  program*.  This  presents 
no  special  difficulties  alnce  the  hardware  Capability 
mechanism  which  monitors  and  constraints  the  switching  of 
control  between  program*,  makes  no  distinction  between 
Operating  System  and  user  programs. 

3. 1.2.2.  SOFTWARE,  The  System  250  software  organization  la 
deacrlhed  in  D.M.  England's  paper  presented  at  the 
International  Switching  Symposium,  June  1972.  (See  1.7.) 

3.2.  FAULT  TOLERANCE 

3.2.1.  FAULTS  TOLERATED:  The  System  250  architecture 
allows  at  least  one  redundant  modula  of  each  type  to  he 
provided  in  a  system.  Thua  the  system  will  carry  on 
operating  in  the  face  of  hardware  faults  provided  that  at 
ieaat  one  module  of  each  type  remains  fault-free.  Faults 
caused  by  aoftware  errors  will  normally  only  occur 
(assuming  progra**  have  bean  properly  debugged)  when  rather 
rare  cost)  in  at  ions  of  data  and/or  timing  ere  encountered. 

Ths  software  recovery  procedurea  outlined  in  3.2.3  below 
allow  the  unusual  circumstances  surrounding  the  fault  to  be 
avoided  by  employing  increasingly  powerful  (and  hence  more 
disruptive)  recovery  actlona  until  the  fault  no  longer 
manifests  itself. 

It  is  recognised  that  a  numbeT  of  ohacure  software  errors 
are  always  likely  to  be  present  in  the  avatem  but  since  the 
circuraatancea,  which  causa  system  faults  to  develop  as  a 
reauit  of  theae  errors,  are  by  definition  rarely 
encountered,  they  will  not  in  general  cause  unacceptable 
service  disruption. 

The  effect  of  s  hardware  or  software  fault  on  the  external 
environment  will  ba  to  cause  one  or  more  of  the  following: 

*  If  the  fault  disables  e  atore  or  processor,  a  permanent 
drop  in  the  throughput  of  the  ays  tern  will  result,  at  least 
until  the  necessary  maintenance  action  is  undertaken. 

*  If  the  fault  is  elsewhere  a  temporary  fall  in  the 
throughput  capacity  of  the  System  will  occur  while  test  snd 
restart  or  reload  measures  are  undertaken.  The  magnitude 
and  duration  of  thia  fall  depends  on  tha  type  of  fault,  the 
status  of  the  System  (i.e.  with  regard  to  work  load)  snd 
the  hardware  and  software  configuration  of  the  System. 

*  Depending  on  the  nature  of  the  fault,  it  may  he  possible 
to  restart  affected  processes  at  the  point  at  which  the 
fault  was  detected  or  it  may  be  neceasary  to  restart 
processes  from  Ihe  beginning.  In  the  telecommuni  cat  ions 
control  application  the  former  action  should  cause  no  loss 
of  calls  whereas  ths  latter  action  may  mean  the  loss  of 
aome  or  all  of  the  calls  haing  handled  by  tha  affected 
processes.  in  the  worat  case  the  whole  aystem  ia  reloaded 
from  hacklog  Store  snd  all  read/write  data  areas  are 
cleared  resulting  in  the  loaa  of  all  calls  being  handled  by 
the  system.  Thia  case  should  be  very  rarely  encountered. 


Faults  in  Serial  or  Parallel  interface  Units  will 
naturally  dlaablu  the  peripherals  to  which  thay  are 
attached.  These  units  nre  allocated  on  a  one  per 
peripheral  baala  thua  a  single  fault  will  only  affect  one 
peripheral  device.  All  communication  paths  betvaan 
procaaaor  and  peripheral  interface  units  ara  duplleatad 
thus  s  fault  in  one  or  more  of  the  units  on  only  one  of  the 
communication  paths  will  not  affect  syatam  operation. 

3.2.2.  FAULTS  NOT  TOLERATED,  it  is  anticipated  that  tha 
only  fault  conditions  not  tolerated  by  the  System  (i.e. 
from  which  the  system  la  unable  to  recocover  automatically) 
Involve  at  least  two  simultaneous  faulta  which 

*  Disable  at  least  two  syatam  hardware  modulaa  of  the  asme 
type  so  that  no  fault  free  modules  of  this  type  remain,  or 
Overide  the  Capability  mechenlam  and  corrupt  ALL  copies 
of  a  vital  software  area  before  the  fault  ia  detected. 

It  la  believed  that  the  chances  of  either  of  the  above 
happening  ere  acceptably  remote,  in  any  specific 
application  the  cl  ineas  of  such  situation*  arising  can 
always  be  reduced  helow  any  finite  limit  by  suitably 
increasing  rediaidancy  of  hardware  and  software  modules. 

3.2.3.  TEGiNiQUES:  The  System  250  architecture  allows  a 
fault  to  he  tolerated  in  any  single  hardware  module  by 
providing  redundant  modules  of  esch  type  in  s  aecura  system 
configuration  Faults  will  normally  be  detected  either  by 
one  of  sn  extensive  range  of  hardware  fault  detection 
mechanisms  provided  in  eoch  PP250  proceaaor  unit  (e.g. 
capability  checks,  parity  checks,  microprogram  checks, 
etc.)  or  by  background  test  routinaa  or  by  consistency 
checks  written  into  the  Operating  Syatam  and  appllcatlona 
software. 

Faults  detected  by  hardware  automat leelly  cause  the 
processor  concerned  to  enter  a  aelf-teat  routine  with  very 
limited  access  to  system  resources.  Procecsors  which 
succesfully  emerge  from  the  aelf-teat  cen  apply  to  rejoin 
the  ayatea,  the  application  normally  being  dealt  with  by 
fault  recovery  software  Lainp  run  on  a  good  proceaaor. 
Proceaaora  which  have  c  bad  history  of  faults  may  be 
refused  permission  to  .rejoin  the  system  and  forced  to 
endlessly  repeat  the  atlf-test  procedure  until  maintenance 
actim  is  undertakan. 

if  a  hardware  fault  ia  traced  to  a  module  other  than  a 
processor  tha  fault  recovery  software  causes  the  faulty 
module  to  be  effectively  Isolated  from  the  system  awaiting 
maintenance  action,  if  (aa  may  be  the  caae  for  certain 
intermittent  hardware  faults)  the  fault  cannot  be  traced  to 
a  particular  module  the  recovery  software  will  as  a  laat 
resort  cause  the  system  to  be  reconfigured  leaving  out 
mockues  on  s  trial  basis,  until  a  feult  free  configuration 
is  achieved. 


EXAMPLE  OF  MEDIUM  INSTALLATION 


Software  recovery  procedures  ar*  arranged  In  a  hierarchical 
structure  starting  wt*h  procedures  which  attest  to  clear 
corrupted  data  and  raatart  failed  proceaaea  and  If  oecca- 

•  ery  working  up  through  aeveril  atajtea  of  auccaalvely  more 
disruptive  recovery  measures  each  Involving  reloading  coda 
and  data  araaa  from  coplea  on  backing  at ora.  Eventually, 

If  all  clae  fails,  thla  culminate*  In  clearlop  Jll  code  and 
data  areas  fro*  faat  atora  (aacept  certain  areaa  containing 
replicated  coplaa  of  tha  basic  recovery  programs),  clearing 
all  read/vrl ta  data  from  backlog  atore  and  reloading  all 
programs  and  read/only  data  from  backlog  atore* 

Recovery  from  both  har<h*ar*  and  aoftware  faults  Involve  the 
**Pl**,nt*tlOB  *  aeriea  ,f  Increasingly  dlaruptlve 
action/  ue»t  11  rha  fault  la  cleared.  When  one  proceaaor  In 
a  multi-processor  system  bc-omte  faulty  aid  enters  Its 
self-teat  routine,  Its  abaenca  la  soon  noticed  by  a  Syac*r 
Monitor  process  which  Inanecta  th.  status  of  the  system  and 
la  aehaduled  to  run  at  regular  Intervals,  This  process 
Initiates  a  high  priority  recovery  process  which  la  then 
placed  In  a  'ready  to  rim'  list  and  In  due  courea  la 
scheduled  to  rim  on  a  good  processor  Just  like  any  other 
process. 

3.3.  NOVELTY t  The  most  unusual  design  features  of  the 
system  are  as  follows: 

*  Tha  system  as  been  designed  to  be  fault  tolerant  to  a 
^•ttss  hither  >  unheard  of  In  commercially  aval  lab  1« 
computer  systems. 

•  System  power  and  storage  may  ba  expanded  independently 
ai^>ly  by  adding  furthar  processors  or  storage  units. 

•  Additional  "rcdimdant"  Processors  and  stores  added  to  a 
system  to  enhance  its  reliability,  also  parform  useful  work 
and  thus  Increase  the  computing  capacity  of  the  system. 
Th*aa  modules  are  therefore  not  redundant  In  tha  same  sense 
ea  radimdant  modules  lo  many  other  systems  which  purely 
perform  a  backup  function  and  do  not  usefully  contribute  to 
system  performance  In  tha  Usance  of  a  fault. 

•  System  hanhirs  and/or  software  modules  can  ba  lnaarted, 
removed  and/or  modified  whilst  tha  system  In  on-line  with 
no  consequent  loss  of  service. 

*  Data  and  program  security  la  preserved  by  a  hardware 
Implemented  Capability  mechanism  which  not  only  defines  the 
areaa  of  store  or  the  Input/outpuL  ayetsm  which  are 
acceaalbla  by  a  program,  but  also  defines  tha  type  of 
acceaa  alleged.  Thla  Is  a  particularly  Important  festura 
when  the  sharing  of  store  la  allowed  between  processes. 

*  There  Is  no  privileged  mode  of  processor  operation. 
Operating  System  programs  are  abject  to  the  same  security 
restrictions  (*n forced  by  Capabilities)  as  user  programs. 

•  In  the  avent  of  a  fault  bstog  detected  in  hardware,  a 
hierarchy  of  autof^tlc  recovery  procedures  is  entered  with. 
If  necessary,  successively  more  disruptive  measures  being 
taken  In  ordei  *o  •  cover  a  working  Svatam.  This  leads  to 
a  trial  reconfiguration  procedure  if  all  else  tails. 

*  Diagnosis  of  a  faulty  hardware  module  smy  ba  carried  out 
on-llna  with  no  locreaaed  risk  to  the  rest  of  the  system. 

•  The  Input/output  system  la  designed  to  be  vary  flexible 
In  its  configurability  and  In  particular  allows  very  large 
mwbare  of  low  activity  peripheral  devlcea  to  be 
efficiently  dealt  with. 

*  No  axtemal  Intarrupta  are  allowed  Into  the  proceaaora 
(for  security  re  is on a)  and  all  Input/output  la  handled  via 
polling  procaduraa, 

*  Virtual  memory  is  used  In  a  real-time  context. 

3.4.  INFLUENCES:  Tha  uaa  of  Capabllltlea  to  structure  and 
protect  tha  System  250  software  has  been  atgnlftcantly 
Influenced  by  the  research  work  of  Dr.  R.  S.  Fabry  carried 
out  at  the  Univcreitv  of  Chicago  or  “Hat  Structured 
Addreaatng"  and  alao  by  the  Ideas  end  advice  of  Professor 
M.  V.  Wilkes  of  the  University  of  Cartridge. 

3.5,  HARDCORE:  The  PF250  proceaaor  has  been  dealgned  such 
that  tha  conventional  conception  of  hardcore  (i.e.  that 
alngle  portion  of  tha  ayatam  which  rust  work  In  order  to 
make  tha  ayatem  work  or  meke  dlagnoeta  possible)  has  been 
avoided.  Replication  of  all  vital  system  hardware  and 
aoftwara  modules  ensures  that  no  alngle  aodula  failure  cm 
bring  the  ayatem  down. 

4.  JUSTIFICATION 

4.1.  RELIABILITY  EVALUATION:  Syatcn  reliability 
calculatlona  have  been  carried  out  using  aatlfcated 
M.T.B.F’a  and  H.T.T.R'a  of  system  hardware  modules.  These 
in  turn  ware  calculated  from  measured  failure  rates  of 
Individual  hangars  componeota.  Tha  processor  aelf-teat 
program  has  hcen  tested  using  a  logic  level  simulation 
program  for  the  processor  into  which  a  large  nusber  and 
variety  of  faults  ware  Injected. 


4.2.  COMPLETENESS  OF  EVALUATION:  Tha  design  evaluation  la 
expected  to  continue  for  some  considerable  Urn  (If  Indeed 
It  aver  atope)  especially  In  tha  light  of  rtnolng  Operating 
System  (well  under  way)  and  applications  programs. 

4.3.  OVERHEAD:  This  depends  very  much  on  tht  required 
ayatem  powei  and  tha  required  level  of  reliability.  In  the 
Stalest  ccura  configuration  In  whld  all  mndulaa  are 
duplicated  one  could  ergue  ehet  tore  then  502  cf  the  cost 
is  devoted  tn  the  provision  of  fsult  tolerence.  Howavcr, 
even  In  this  case  the  extra  processor  end  extra  store  make 
raal  contributions  to  system  performance  end  so  ctuhlon 
system  igetnst  Instantaneous  peeks.  See  also  3.3,  Item  3. 

In  large  systems  the  ratio  of  essential  to  "redundant" 
hardware  msy  be  greater  than  5  to  1  depending  on  ayatem 
•lae  and  the  desired  level  f  reliability. 

The  proportion  of  feat  storage  devoted  to  fsult  recovery 
software  in  a  typical  telacotonlcatlona  application  will 
probably  be  not  mors  than  252  and  could  be  a  lot  less  in  a 
Urge  system.  Probably  tore  than  502  of  backing  storage 
hex* ever  Is  present  lo  order  to  echl/ve  fault  tolcr«ee 
sines  backlog  storage  containing  coplea  of  all  ayatem 
software  must  be  duplicated  for  reliability. 

It  ts  difficult  to  assess  the  cost  overhead  associated  with 
the  use  of  capabilities  sines  their  usefulness  axtenda  far 
beyond  Just  fsult  protection. 

4.4.  APPLICABILITY:  The  system  is  applicable  to  slsost  any 
rtil«tlM  control  application  but  particularly  those  with  s 
good  reliability  and  expansion  potential. 

4.5.  EXTENDAB1L1TY »  This  question  cannot  be  satisfactorily 
•never* d  at  this  a  age  as  It  requires  a  much  more  collet* 
evaluation  of  the  \  resent  system  design. 

4.6.  CRITICALITIES:  Both  multi-programming  and 
multiprocessing  are  fundamental  to  the  achievement  of  tha 
•ystsm  design  alms.  The  choice  of  11  except  peripheral 
and  storage  harder*  la  critical  as  nil  othar  ayatem 
har**are  modules  have  built  in  features  which  are  closely 
matched  to  the  overall  system  requirements.  It  is  of 
course  possible  to  uaa  modules  which  offer  the  asms 

fad '.Atlas  and  interfaces  buk  differ  internally  lo  detailed 
implamentatloo. 

4.7.  IMPLICATIONS:  Halo  requirement  on  hardware  ayatam 
designers  la  that  design  should  not  allow  alngle  hardware 
fjlluraa  to  generate  further  failures  and  thus  spread  to 
several  modules.  Software  deslfnars  or*  reapon-lbls  re¬ 
inserting  conals taocy  checks,  etc.,  in  their  own  programs. 
They  should  also  writs  routines  which  enable  execution  of 
their  programs  to  be  restarted  followtog  a  detected  fault. 
TM*  responsibility  alao  extend*  to  user  programmers. 
Maintenance  action  should  be  designed  such  thet  it  can  be 
carried  out  on-line. 

5.  CONCLUSIONS 

5.1.  STATUS:  Several  pre-production  toltlprocosaor  systems 
working  and  under  evaluation.  Production  expected  to 
cowenc*  in  mlddla  of  1973.  First  delivered  production 
ayetsm  expected  to  be  f.lly  operational  in  September  1974, 

5.*.  EXPERIENCE:  The  basic  system  philosophy  la  a  proven 
success.  Planned  development  tarpata  are  being 
conals ;sntly  achieved.  Some  minor  modifications  are  being 
introduced  as  a  result  of  evaluation  of  a  otwber  of 
prs-j  oductlon  ays  tews. 

5.3.  FUTURE:  As  a  general  policy  ayatam  Implementation  is 
continually  wider  critical  review  in  the  light  of  operatlne 
experience  and  advances  In  technology.  The  ability  to 
allow  system  evolution  la  essential  in  applications  such  as 
telecowunl  cat  tons  control  where  tha  ayatam  la  designed  to 
operate  continuously  for  perhaps  several  decades. 

5.4.  ADVANCES?  In  respect  of  system  architecture,  ao  many 
novel  features  are  Incorporated  In  the  present  System  250 
design  that  thacaa  require  a  mors  coexists  evaluation 
before  all  of  J»e  Important  Implications  becoto  appareot. 
it  la  therefore  not  posabtbla  at  this  stags  to  Indicate 
architectural  advances  which  are  obvtowly  desirable. 

Obviously  desirable  advances  in  hardware  technology  Include 
Increased  reliability  of  peripheral**  and  minimum  uaa  of 
moving  part  mechanical  techotquae  in  particular. 


possible  n«Mir  of  flitd  point  addition*  per  second  for  a 
single  FF250  processor  11m  between  about  500,000  and 
900,000  depending  on  th#  type  of  store  used  (vi«.  B50ne 
cor«.  or  300os  platad  wire)  and  on  whethei  the  additions  la 
a  ator%  rafaranca  or  register  to  register  oparatlon. 

3. 1.2.1.  MOOES j  Th#  ays ta»  la  a  multi-CPI'  eyetaa  with  all 
halnp  Mynchronoua  ldantlcal  units  of  aqual  eteti*. 

All  Operating  Syata*  modules  ara  ra-antrant  and  thus  My  ba 
eaecutsd  by  several  processors  simultansoiwly  and 
lndapandaatly.  The  Operating  System  will  oormally  ba 
antarcd  b>  a  auhroutlna  call  but  can  alao  ba  antarad  a*  a 
raault  of  a  program  trap  or  w  tha  rasult  of  a  tlar 
Maturing  within  a  CPU. 

Tha  Syata*  la  MjltlprogrsMaeb la  with  aach  procasaor  being 
riai  on  a  tlna-aharlng  basis.  Multi-proceeel;.r  la  a  standard 
faatura  of  tha  Syataa  and  procaaaas  can  ba  run 
lndapaodant  ly  ro#  or  In  controllad  co-op# rat  Ion  with  othar 
procaaaas. 

In  Syata*  i.50  a  procsaa  la  >.  dynamic  antlty  and  la  defined 
aa  tha  sMcutlon  of  program  coda  on  a  particular  aat  of 
Input  data,  liacause  of  tha  protactlon  afforded  by 
Capabilltlae,  many  procaaaas  can  aafaly  share  a  particular 
block  of  program  coda  alaul  meouely,  but  aach  will  aucuta 
it  on  a  dlffarant  sat  of  de^  ..  A  nrocaaa  mv  only  be  run 
on  o»w'  processor  at  a  tl«a  but  My  In  ganaral  run  on 
several  different  procaaaorc  coneacut ively. 

Slnca  It  la  data  and  not  coda  which  dletlogulshee  ot  f 
process  fro-  anothsr,  procaaaas  are  allowed  to  croaa  .  **a 
conceptual  bowidsry  between  Operating  Syata*  and  twsr 
progress  in  just  tha  a««  way  m  they  would  croaa  tha 
bowlder  lee  between  Individual  uaar  programs.  TM*  present  a 
no  special  difficulties  tinea  tha  hsrfeers  Capability 
aechsnism  which  aonlcors  and  conatrslota  the  switching  of 
control  bstweeo  progreaa,  Makes  no  distinction  between 
Operating  System  and  uaar  progrsne. 

3. 1.2.2.  SOFTWARE i  lh*  Systen  250  eoftwere  organisation  la 
described  In  P.M.  England's  neper  presented  at  the 
internet  Ion  el  Swlcchlog  Symposium,  June  1972.  (See  1.7.) 

3.2.  FAULT  TOLERANCE 

3.2.1.  FAULTS  TOLFRATEDi  The  System  25C  srchltecture 
ellowe  et  leut  oni  redundant  module  of  each  type  to  be 
provided  in  s  system.  Thus  the  system  will  carry  on 
operating  In  the  fees  of  her<h#ere  faults  provided  thet  at 
laMt  one  module  of  aach  type  rsMlns  fault- f res.  Faults 
caused  by  software  errors  will  norMlly  only  occur 
(Meumlng  prograM  have  been  properly  dsbuggcu)  whan  rather 
rare  coahlnstlone  of  date  and/or  timing  ere  encountered. 

The  eoftwere  recovery  procedures  outlined  in  3.2,3  below 
ell<**  tha  unusual  circumstances  surrounding  tha  fault  to  ba 
avoided  by  employing  Increasingly  powerful  (and  hencs  aora 
disruptive)  recovery  actions  uitil  the  fault  no  longer 
Mnlfaets  itself. 

It  1*  raeognlasd  that  a  number  of  obscure  software  arrore 
ara  always  likely  to  be  present  In  the  eyetsm  but  tinea  tha 
clrcuaatancaa ,  which  causa  ayataa  faults  to  develop  as  a 
result  of  thees  errors,  ara  by  dsflnltlon  rarely 
encountered,  they  will  not  In  ganaral  cause  wiercsptshle 
service  disruption. 

Tha  effect  of  a  hsrfeere  or  software  fault  on  tha  internal 
environment  will  ba  to  causa  one  or  more  of  the  following: 

*  If  tha  fault  disables  a  store  or  processor,  a  permanent 
drop  in  the  throughput  of  the  system  will  result,  at  laMt 
witll  tha  necessary  Mint  an  an  ce  action  la  widertaken. 

•If  tha  fault  la  slaawhsrs  a  temporary  fall  In  tha 
throughput  capacity  of  tha  System  will  occur  while  teat  and 
restart  or  reined  Masures  are  undertaken.  The  magnitude 
and  duration  of  this  fall  depends  on  tha  type  of  fault,  the 
status  of  tha  System  (l.e.  with  regard  to  work  load)  and 
tha  hardware  and  software  conflguretlon  of  the  System. 

*  Depending  on  tha  nature  of  the  fault.  It  may  he  possible 
to  restert  affected  proceseee  et  tha  point  at  which  tha 
fault  was  detected  or  If  My  be  nscssssry  to  restert 
processes  from  the  baglnolng.  In  tha  tslscoam'wilcetlone 
control  application  tha  fonmr  action  should  citai  no  loss 
of  calls  whereas  tha  latter  action  may  mean  tlis  lose  of 
bom  or  all  of  the  calls  being  hwidlsd  bv  tha  affected 
procaaaas.  In  tha  worst  case  the  whole  eyetsm  Is  reloedsd 
from  beeklng  Store  and  all  resd/wrlta  data  areas  ara 
clsered  resulting  In  tha  loss  of  all  calls  being  handled  by 
the  kyatem.  This  cm#  should  b«  vary  rarely  sncowitsrsd. 


•  Feulta  In  Sarlel  or  Parallel  loterfaca  Unite  will 
naturelly  disable  the  perlpherels  to  which  rhey  ere 
atteched.  These  wilts  are  allocated  on  a  one  par 
peripheral  basis  thus  a  single  fault  will  only  affect  one 
peripheral  device.  All  cowwilcetlon  peths  between 
processor  and  peripheral  loterfaca  ir  its  ara  duplicated 
thus  a  fault  In  one  or  mors  of  tha  u  Its  on  only  one  of  tha 
coMaicitlM  paths  will  not  affect  ayataa  oparatlon. 

3.2.2.  FAULTS  NOT  TOLERATED:  It  la  anticipated  that  the 
only  fault  conditions  not  tolarstsd  by  tha  Syataa  (l.s. 
froa  which  tha  ayataa  la  unable  to  rscocover  eutoMtlcslly) 
Involve  at  least  two  elnuUanaous  faults  which 

•  L'aah  at  laMt  two  ayataa  her 6* ere  module*  cf  the  terns 
typ#  *o  t’*t  oo  fault  free  module#  of  thl*  type  rsMlo,  or 

•  Ovarlde  the  Capability  etchants*  and  corrupt  ALL  copies 
of  a  vital  software  tree  before  tha  fault  la  detected. 

It  la  believed  thet  the  chance#  of  either  of  tha  above 
happm-lng  are  acceptably  remote.  In  sny  specific 
ep.  lest  Ion  the  chances  of  such  situations  arising  can 
always  be  reduced  La  low  any  flolts  Halt  by  suitably 
loeressing  rsdiadancy  of  herfesr*  and  software  modules. 

3.2.3.  TEQW1QUES:  The  Syata*  250  architecture  allows  s 
fault  to  ba  tolarstsd  In  any  single  hardware  module  by 
Providing  redundant  module*  of  eech  t/pa  lo  a  secure  eyetsm 
configuration.  Faults  will  ooraelly  ba  detected  either  by 
on*  of  an  extensive  rang*  of  herfesre  fault  detection 
MchenlsM  provided  In  aach  FP250  procasaor  unit  (e.g. 
capability  checks,  parity  cheeks,  microprogram  checks, 
ate.)  or  by  beckgrowtd  teat  routines  or  by  consistency 
checks  written  into  the  Operating  System  and  applications 
software. 

Faults  detected  by  hardware  eutoMtlcslly  causa  tha 
procasaor  concerned  to  enter  a  aslf-taet  routine  with  vary 
limited  access  to  ayataa  resources.  Processors  which 
auecesfully  emerge  from  the  self-taat  can  apply  to  rajolo 
the  eyetsm,  tha  application  norMlly  being  dealt  with  by 
fault  recovery  eoftwere  being,  net  on  a  good  procasaor. 
Processors  which  have  a  bed  history  of  faults  My  ba 
ret  used  permission  to  rajolo  tha  aye  tarn  and  forced  to 
endlessly  repast  tha  self-t«et  procedure  wit  11  malotananca 
action  is  undertaken. 

If  a  herthfsre  fault  is  traced  *.o  a  module  other  thao  a 
processor  the  fault  recovery  software  causes  tha  feulty 
module  to  ba  affectively  isolated  from  tha  systsr  awaiting 
Mlnt-^naoee  action.  If  (aa  may  b#  the  case  for  certain 
lnte:  ttant  hsr<h*sr*  faults)  the  fault  cannot  ba  traced  to 
a  particular  module  tha  recovery  software  will  aa  a  lwt 
raaori  cause  the  system  to  ba  reconfigured  leaving  out 
modnlss  on  e  trial  bull,  until  a  fault  fres  configuration 
is  achieved. 


EXAMPLE  OF  MEDIUM  INSTALLATION 
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APPENDIX  3 


DETAILED  CONSIDERATIONS  OF  MEMORY  RECONFIGURATION 


This  appendix  considers  detailed  aspects  of  reconfiguration  of  memory 
systems,  not  only  of  the  memory  circuits  themselves,  but  of  other 
components  such  as  data  busses,  address  decoders,  etc. 

The  memory  is  assumed  to  be  built  from  a  number  of  units  —  for  example 
LSI  chips,  each  having  the  same  memory  capacity.  When  a  fault  is 
detected,  some  of  these  units  are  discarded,  and  either  they  are 
replaced  by  a  similar  number  of  spare  units,  or  the  system  now  has 
reduced  memory  capacity.  The  terms  used  are  as  defined  in  section  6. 


A3.1.  MEMORY  RECONFIGURATION  BY  BlOCK  REPLACEMENT 

We  restate  the  reliability  estimates  previously  given  in  section  4.2. 

I  (w-w '  )/yJ 


,  r- 

P[>  w  :  w]  =  ^  Ci  P 

(Al) 

i=0 

where 

/  u\  i  (u-i) 

pn  ■  (i)  vi-v 

(A2) 

and 

u  =  w/y 

(A3) 

A3. 2  THE  USE  OF  CODING  WITH  BLOCK  REPLACEMENT 

The  questions  addressed  in  this  section  are:  What  is  the  optimum  number 
b  of  bits  per  byte?  and,  given  the  probability  p  of  chip  failure,  what 
is  the  value  for  P[>  w^w] 

If  w'  words  are  required  to  be  still  available  after  t  blocks  have 
become  faulty,  then  W«(w'+yt).  The  number  of  chips  required  is 
therefore 


N  -  (k+r) (w'+yt) /(yb) 


(A4) 


Let  N(b,t)  be  the  number  of  chips  required  for  varying  b  and  t  for  the 
case  w  ■  16k,  yb  *  4k,  n  ■  32.  Then  a  few  important  values  of  N  are 

N(l,  t)  -  152  +  38t 
N(2,  t)  -  152  +  I9t 
N(4,  t)  -  160  +  lOt 
N(8,  t)  -  192  +  6t. 

There  are  good  reasons  for  b  to  be  a  power  of  2,  although  codes  of 

course  exist  for  other  values  of  b  (see  Section  4.1).  The  reliability 
can  be  expressed  as 

(w-w  ')/y 

P[>  I  (,/y)  p‘(l-p  <A5> 

t=0  '  1  f 

where 

Pf  “  l-(l-p),  (A6) 

whence 

(w-w')/y  . 

P  =  Pf>  w'  :  w]  -  V  ( w/y  )(1_(i_p)  (k+p)/rj  (i_p)  (k+P)  (w/y-i  )/b  (A7) 

Figures  A3. 1(a)  to  (e)  show  Ps  or  Pf  -1-Ps,  i.e.,  the  probability  of 
success  or  failure  for  p-10  n  ,  n-1...5,  and  k-32,  b-1,2,4,8  and  w=16k. 


A3. 3.  RECONFIGURATION  BY  CHIP  REPLACEMENT 
A3. 3.1. THE  MEMORY  MODEL 

The  basic  model  is  depicted  in  Figure  A3. 2.  It  consists  of  a  decoder 
(for  high-order  address  bits),  an  input  switching  network,  an  output 
switching  network,  and  a  set  of  memory  chips.  Each  memory  chip  acts  as 
a  y-word  by  b-bit  RAM.  The  following  parameters  describe  the 
configuration  of  the  main  memory: 


d  “  number  of  bytes  per  n-bit  word  (typically  4  to  16),  d»n/b 

number  of  blocks  of  memory  (typically  4  to  2048),  where  a  block 
consists  of  y  words 

s  ■  number  of  spare  chips  (typically  small  relati/e  to  total  memory 
size) 

m  ■  total  number  of  chips  ■  (zd+s) 
t  *  number  of  faulty  chips  to  be  tolerated. 


Fig.  A3. 2  General  model  for  reconf igurable  memory. 

From  the  standpoint  of  maximal  use  of  spare  chips,  s*t  is  desirable; 
however,  as  seen  below,  some  benefits  accrue  from  having  s>t  in  terms  of 
switching-network  regularity  and  simplicity,  and  ease  of  switch  set-up. 
In  this  model,  t  is  the  guaranteed  fault-tolerance,  i.e.,  the  memory  can 
be  configured  into  z  blocks  of  d  chips  in  the  presence  of  all 
combinations  of  t  or  fewer  memory  chip  failures.  In  one  of  the  examples 
below,  s  t.  In  this  case  there  are  sufficient  spares  to  correct  more 
than  t  failures,  but  switching-network  limitations  may  prevent  this 
extended  ccrrectlon.  However,  an  analysis  shews  that  the  number  of  such 
offensive  combinations  is  vanishingly  small,  and  that  certain  economies 
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in  switching-network  complexity  are  attained  by  keeping  the  guaranteed 
correction  below  the  number  of  spares.  Thus,  in  any  event  the  value  t 
itself  is  not  sufficient  to  evaluate  the  reliability  of  the  memory. 

The  memory  function  is  to  be  configured  out  of  a  set  of  zd  operative 
chips.  The  block  selection  is  accomplished  by  the  dt-coder,  which 
selects  one  appropriate  control  line,  under  control  of  the  log2z  higher 
order  addrees  bits.  The  lower-order  address  bits  are  delivered  to  all 
memory  chips,  with  the  word  selection  accomplished  by  a  decoder  within 
each  chip.  The  appropriate  configuration  is  achieved  by  setting  up  the 
input  and  output  switching-network  pairs  (SNP).  Note  that  the 
connection  established  by  the  SNPs  needs  to  be  modified  only  when  the 
memory  is  reconfigured. 

For  most  of  this  section,  only  single-level  incomplete  cross-bar  arrays 
are  considered.  Note  that  in  contrast  with  the  telephone  cross-bar 
arrays,  the  switching  networks  for  the  memory  organization  require 
switches  at  comparatively  few  cross-points. 

As  a  better  illustration  of  the  role  of  the  SNPs,  consider  a  simple 
example  for  which  z-4,  d-3,  s-6,  m-18,  t-5.  Figure  A3. 3  displays  one 
possible  set-up  of  the  switching  networks  to  accommodate  the  indicated 
faulty  chips.  Each  utilized  chip  is  identified  according  to  its  place 
in  memory;  that  is,  for  a  chip  at  (i,j),  "i"  signifies  the  block  and  "j" 
signifies  the  byte.  The  activation  of  a  particular  block  of  d  chips  is 
accomplished  by  activating  the  appropriate  control  line.  This 
activation  signal  is  transferred  through  the  input  switching  network  to 
a  unique  set  of  d  chips.  The  memory  word  emerging  from  the  d  chips  is 
transferred  tc  a  unique  set  of  d  output  data  lines  by  the  output 
switching  network. 

As  not^d  in  the  next  section,  this  example  illustrates  a  nonseparable 
switching  network-pair  that  is  an  SNP  for  which  the  set-up  at  the  input 
and  output  networks  must  be  accomplished  together.  For  a  separable 
network  pair  the  set-"p  of  one  of  the  two  networks  can  be  done  first  in 
its  entirety,  independently  of  the  other. 

i 
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Fig.  A3. 3  An  example  chip  reconf igurable  memory. 


Before  embarking  on  the  details  of  the  synthesis  procedures  for  these 
SNPs,  it  is  worthwhile  to  indicate  the  possible  benefits  of  this 
organization,  as  compared  with  other  fault-tolerant.  memory 
organizations.  Consider  a  modest-sized  memory  requirement  of  32 
kilowords,  each  32  bit  long.  Such  a  requirement  can  be  achieved  with  16 
blocks,  each  of  which  contains  16  2-bit-wide  chips^  for  a  total  of  256 
chips.  Assuming  a  chip  failure  probability  of  10  per  hour,  in  a 
mission  of  five  years  ten  failures  might  be  expected.  For  the 
organization  discussed  in  this  section,  a  tolerance  of  10  failures 
requires  a  redundancy  of  10  chips,  or  under  4%  redundancy.  This  can  be 
contrasted  with  a  memory  system  wherein  an  entire  block  is  replaced  upon 
the  occurrence  of  any  chip  failure  within  the  block.  For  this  latter 
system  to  achieve  comparable  reliability,  a  redundancy  of  abwut  50%  is 
required.  Two  comments  are  in  order  here.  First,  the  lower  redundancy 
measure  is  meaningful  only  if  the  switching  -overhead  is  small— a 
situation  that  we  will  now  show  to  be  the  case.  Second,  chip 
replacement  becomes  more  favorable  as  b  is  decreased  and  y  increased 
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(with  yb  held  constant).  Moreover,  if  error  detecting  and  correcting 
techniques  are  used  in  addition  to  replacement,  then  the  smaller  byte 
sizes  are  preferable  from  the  standpoint  of  lower  code  redundf.icy  (and 
simpler  decoding  circuitry) . 

A3. 3.2.  SWITCHING  NETWORK  SYNTHESIS  TECHNIQUES 

In  this  section  we  are  primarily  concerned  with  establishing  conditions 
for  the  existence  of  suitable  single-level  cross-bar  switching  networks. 
The  last  subsection  below  deviates  from  this  single-level  formulation, 
to  indicate  a  less  costly  multi-level  network  that  handles  large  values 
of  t. 

It  will  be  convenient  to  view  the  input  network  as  described  by  the  z  by 
m  matrix  SI,  and  the  output  network,  by  the  d  by  m  matr'-i  SO.  A  "1"  in 

a  particular  location  (e,f)  of  the  matrix  corresponcs  to  a  switch  in  row 
e  and  column  f  of  the  network.  The  following  '.heorem  gives  necessary 
and  sufficient  conditions  for  the  matrices  SI  and  SO  such  that  the  SNP 
is  capable  of  reconfiguring  the  memory  in  the  presence  of  any  t  or  fewer 
failed  memory  chips. 

THEOREM  1:  For  the  single- level  incomplete  cross-bar  input  and  output 
switching  networks,  there  exists  a  jetting  of  the  switches  such  that  in 
the  presence  of  t  or  fewer  memory  chip  failures,  the  operative  chips  can 
be  configured  into  an  array  of  z  rows  by  d  columns,  as  long  as  the  union 
of  each  combination  of  i  rows  of  SI,  1-1,  2,  ...,  z,  and  the  union  of 
each  combination  of  j  rows  of  SO,  j-1,  2,  ...,  d,  overlap  in  at  least 
ij+t  places. 

The  proof  is  an  extension  of  the  Diversity  Theorem  (Ore  63),  which  gives 
necessary  and  sufficient  conditions  for  the  assignment  of  workers  to 
jobs 

The  necessity  part  is  obvious  since  for  some  set  of  i  rows  of  the  input 
network  and  j  rows  of  the  output  network,  there  must  be  lj  paths  when 
the  networks  are  configured.  Since  up  to  t  chip  faults  are  to  be 
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VMIMMWMI 


MMIff 


toleratec,  wherein  each  chip  failure  disables  a  path,  and  since  a  path 
corresponds  to  the  appearance  of  l*s  in  a  column  of  SI  and  SC  ,  the 
necessity  part  is  seen.  The  sufficiency  part  will  be  proven  by  strong 
induction  on  i,j. 


(a)  The  sufficiency  part  is  trivially  true  for  i-j-1,  since  an  overlap 
of  t+1  places  between  a  row  of  SI  and  a  rcw  of  SO  guaran  ees  at  least 
one  good  path  for  t  or  fewer  failures. 


(b)  Define  an  ordered  row-pair  Co,  B  )  as  cons  is  tit  g  of  row  ce  of  SI 
and  row  B  of  SO.  Define  a  N-l  ordered  row-pair  set  (or  simply  N-l  set 
for  short)  as  a  set  of  N-l  such  ordered  row-pairs.  The  intention  here 
is  that  if  (a,  B)  is  in  an  ordered  row-pair  set,  then  there  exists  a 
path  between  row  a  of  the  input  network  and  row  B  of  the  output  network. 
Now  assume  that  if  the  conditions  of  Theorem  1  are  satisfied  for  all  N-l 
ordered  row-pair  sets,  then  an  appropriate  setting  of  the  switches  can 
be  achieved  to  establish  paths  (a,  B)  for  all  contained  in  the  set. 

Note  that  the  theorem  condition,  abstracted  for  the  N-l  ordered  row-pair 
set,  is  that 


* 


M 

V(O'.B) 


%  *  N‘1+t  • 


contained  In  M  subset 


(A8) 


where  is  the  set  of  overlap  positions  between  row  a  0f  SI  and  row  B 
of  SO. 


Byvjj,B)(°Verlap  between  row  a  of  SI  row  g  of  SO),  we  mean  the 
number  of  distinct  columns  for  which  there  is  a  "1"  in  position  (  a,|  ) 
of  SI  and  (  F,b)  of  SO  ,  taken  over  all  (  o,g  )•  The  "union"  operation 
signifies  that  we  count  a  column  only  once  no  matter  how  many  times  it 
appears  because  of  distinct  (  <*,8). 

(c)  As  the  induction  step,  we  will  show  that  given  condition  (b)  above 
and  the  premise  of  the  theorem  for  all  N-l  ordered  row-pair  sets,  then 
the  theorem  is  true  for  all  N  ordered  row-pair  sets.  There  are  two 
cases  to  consider.  In  the  first  case  assume  that  for  a  particular  set 
of  t  or  fewer  chips,  all  M  subsets  M<N  of  the  N  ordered  row-pair  set 
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satisfy  the  conditions  of  the  theorem  with  room  to  spare •  That  is! 


M 

V(o,S) 


contained  in  M  subset 


(A9) 


Then  for  any  (  a,  8)  in  the  N  ordered  row-pair  set,  make  an  arbitrary 
switch  setting  to  establish  the  path  between  o and  P .  After  removing 
(a,  8),  what  is  l?.ft  is  an  N-l  ordered  row-pair  set,  and  this  set 
satisfies  the  conditions  of  the  Theorem  as  in  (b)  above,  thus 
establishing  the  induction  step.  In  order  to  see  that  the  theorem 
conditions  are  satisfied  after  removing  (or,  B)»  note  that  any  N-l 
subset  lift  will  have  lost  a  maximum  of  ”1"  from  the  summation  of 
overlaps — namely  that  corresponding  to  the  column  .  Thus  after 
rrnovi njr.  the  t  or  fewer  columns  in  error  and  the  column  caused  by  the 
"removed"  (  a,  8  ),  we  find 


|U| 

V<ar,p> 


P 


»B  >  N-l  . 


(A10) 


in  N-l  subset 


In  the  second  case,  assume  that  for  a  particular  set  of  t  or  fewer  chip 
failures,  after  removing  all  paths  through  the  failed  chips,  there  is  at 
least  one  subset  that  satisfies  the  theorem  conditions  exactly,  that  is: 


|U| 

V(o,B) 


in  M  subset 


M 


(All) 


If  we  then  assign  the  appropriate  paths  for  all  (a>  3  )  contained  in  the 
M  subset,  which  ve  know  we  can  do  by  virtue  of  (b)  above,  then  for  the 
(  a,  8)  ordered  rciw-pairs  in  the  complement  set,  we  have 


|U|  P  >  N-M 
q9  — 

V(or,8) 

in  N-M  subset 


(A12) 


since  the  entire  N  set  satisfied  the  theorem  conditions.  Thus  by 
virtue  of  (b),  path  assignments  can  be  made  for  the  N-M  subset.  The 
theorem  is  then  established  for  an  arbitrary  N  ordered  row-pair  set,  so 
that  it  is  certainly  satisfied  for  a  set  composed  of  bd  elements, 
namely,  b  rows  by  d  rows. 
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In  a  subsection  below  we  illustrate  SNPs  that  satisfy  the  conditions  of 
this  theorem.  The  one  disadvantage  of  these  nonseparable  SNPs  is  that 
the  switch-setting  algorithm  must  deal  with  both  the  input  and  output 
network  simultaneously.  The  situation  is  improved  with  the  separable 
networks  defined  next. 

DEFINITION:  A  switching  network  pair  is  SEPARABLE  with  respect  to  the 
input  network  if  the  switches  can  be  set  to  achieve  the  configuration  of 
the  memory  into  z  rows  and  d  columns,  in  the  presence  of  t  or  fewer 
failures,  and  if  the  appropriate  settings  of  the  input  network  can  be 
decided  without  knowledge  of  the  output  network.  The  settings  of  output 
network  switches  are  thea  decided  after  those  of  the  input  network. 
(Separability  with  respect  to  the  output  network  can  be  similarly 
defined,  although  no  advantage  seems  to  be  found  in  such  SNPs.) 

The  following  theorem  gives  necessary  and  sufficient  conditions  on  the 
SI  and  SO  matrices  for  the  existence  of  such  a  separable  network. 

THEOREh  2:  An  SNP,  composed  of  single-level  input  and  output  networks, 
is  separable  with  respect  to  the  input  network  if  and  only  if  (iff)  the 
corresponding  SI  and  SO  matrices  satisfy  the  following  properties: 

(a)  The  union  of  all  sets  of  i,  i  «  I,  2 . z  rows  of  SI  contains  at 

least  id+t  ones. 

(b)  Tiie  union  of  all  sets  of  j  rows,  j  -  I,  2 . d,  of  SO  overlaps 

each  row  of  SI  in  at  least  j+t  places,  and  the  symmetric  difference  of 
each  row  of  SI  with  the  union  of  all  sets  of  j  rows,  j  «  I,  2,  ...,  d, 
of  SO  does  not  have  more  than  d-j  ones. 

We  now  develop  a  few  genera]  procedutes  for  synthesizing  single-level 
separable  and  nonseparable  SNPs,  as  well  as  algorithms  for  establishing 
the  switch  settings. 


A3. 11 


A3. 3. 3.  SEPARABLE  SNP  SYNTHESIS 


One  procedure  for  synthesizing  an  SNP  that  is  separable  with  respect  to 
the  input  network  is  illustrated  by  means  of  the  example  of  Figure  A3. A, 
with  parameters  z*6,  d«A,  s“t*3.  The  general  form  of  the  input  network 
for  the  case  s**t  is  as  follows.  The  first  row  contains  switches  in  the 
first  d+s  positions.  The  second  and  all  succeeding  rows  also  contain 
d+s  switches  with  an  overlap  of  s  switches  with  the  preceding  rows. 

Thus  a  given  row  is  merely  the  preceding  row  shifted  d  places.  It  is 
seen  that  the  total  number  of  input  switches  is  z(d+s). 
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Fig.  A3. 4  An  example  of  a  separable  SNP  z  =  6,  d  =  4,  s  =  t  =  3. 

This  input  network  bears  some  resemblance  to  Stiffler's  "rippler"  (see 
Stiffler  73).  The  rippler’s  function  is  to  transfer  data  from  (say)  a 
d-byte  register  to  an  arithmetic  unit  containing  d+R  byte  slices.  The 
transfer  is  such  that  the  order  of  the  bytes  is  preserved  while  avoiding 
faulty  byte  slices.  A  reasonable  form  of  the  rippler  network  is  the 
input  network  of  Figure  A3. A,  where  the  number  of  switches  per  row  is 
R+l ,  and  the  row  overlap  is  R. 
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An  extremely  simple  algorithm  suffices  for  deciding  which  switches  are 
to  be  set  for  the  input  retwork. 

ALGORITHM  1.  For  each  row  in  turn  the  d  leftmost  switches  are 
considered.  For  each  of  these  which  corresponds  to  the  position  of  a 
failed  chip,  this  switch  is  skipped  and  the  next  to  the  right 
considered. 

Now  let  us  consider  the  output  network  as  illustrated  in  Figure  A3. A, 

The  first  d  columns  consist  of  a  diagonal  line  of  switches  followed  by  a 
solid  block  of  switches  in  columns  d+1,  d+2,  ...,  d+s.  Thereafter  the 
network  consists  of  alternating  diagonal  lines  of  switches  and 
"inverted"  diagonals,  with  solid  blocks  of  switches  superposed  on  top  of 
s  consecutive  columns  every  2d  columns.  (The  alternation  of  identity 
arrangements  with  the  inverted  identity  arrangements  provides  a  uearly 
balanced  load  on  each  row.)  The  number  of  switches  in  each  row  of  the 
output  network  is  bounded  from  above  by  z  +  sTz/21 ,  yielding  a  total 
number  of  switches  zs  +  2zd  +  sdTz/21  ,  including  the  input  network; 

'Txl  "  denotes  the  smallest  integer  containing  x. 

An  algorithm  for  deciding  which  switches  are  to  be  set  in  response  to  a 
pattern  set  by  the  input  network  is  quite  simple. 

ALGORITHM  2.  Consider  the  first  d  columns  activated  by  the  setting  up 
of  the  input  network.  Switches  are  to  be  set  in  the  output  network  so 
as  to  connect  each  of  these  d  columns  to  a  unique  output  row.  First  set 
the  switches  in  the  identity  section  to  handle  any  of  the  activated 
columns.  Those  rows  not  yet  served  will  be  handled  by  setting 
appropriate  switches  in  the  solid  block  section.  Then  the  second  group 
of  d  columns  is  handled,  and  so  on,  until  all  groups  are  acconmodated. 

A3. 3. A.  NONSEPARABLE  SNP  SYNTHESIS 


The  primary  advantage  of  separable  SNP's  Is  the  simplicity  of  algorithms 
for  deciding  switch  settings.  One  would  expect  that  a  price  for  such 
simplicity  would  be  an  increase  in  the  number  of  required  switches,  but 
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we  have  not  yet  found  a  nonseparable  SNP  that  is  more  economical  than 
the  separable  network  construction  of  Figure  A3. However,  on-: 
disadvantage  of  the  SNP  of  Figure  A3.4  is  the  excessive  switch  loading 
on  some  of  the  columns  of  the  output  network.  This  is  a  particularly 
severe  problem  if  the  switches  are  part  of  the  memory  chips.  We  have 
attempted  to  find  separable  SNPs  wherein  all  columns  in  the  output 
network  contain  an  equal  number  of  switches.  Such  networks  can 
found,  but  they  are  costly.  This  has  led  us  to  pursue  the  synthesis  of 

nonseparable  SNPs. 

The  nonseparable  SNP  displayed  In  Figure  A3. 5,  for  the  parameters  z-6, 
d-4,  s-6,  t-5,  alleviates  this  difficulty  by  providing  a  nearly  constant 
loading  on  all  columns  of  the  input  and  output  networks.  In  this 

structure  there  is  effectively  one  spare  chip  per  control  line  (•-*). 

The  guaranteed  correction  capability  is  t-5,  independent  of  the  other 

parameters.  Each  row  of  the  input  network  contains  3(d+l)  switches  with 

an  overlap  of  2(d+l)  switches  between  adjacent  rows.  The  rows  of  the 

output  network  are  simple  cyclic  shifts  of  a  repetitive  pattern, 

consisting  of  two  switches  followed  by  d-1  places  with  no  switches. 


The  total  switch  count  for  this  SNP  is  approximately  5z(d+l),  or  about 
twice  that  of  the  separable  SNP  of  Figure  A3.4  with  t-5.  The  structure 
of  Figure  A3.5  can  be  generalized  to  one  containing  t(zd+s)  switches, 
which  tolerates  all  patterns  of  t  or  fewer  chip  failures. 


o  Switch  in  Original  Network 


0  Augmenting  Switch  to  Give 
Correction  of  t  -  • 

Fie.  A3. 5  An  example  of  nonseparable  SNP  for  z  =  6,  d  =  4,  s=6, 

A3. 14  t  =  5(6) 


It  is  possible  to  increase  the  correction  capability  of  this 
nonseparable  SNP  to  t*6  by  augment ing  the  output  network  with  the  extra 
switches  indicated  by  Q  in  Figure  A3. 5.  This  augmentation  places  an 
extra  switch  in  all  columns  of  the  output  network  that  previously 
contained  only  one  switch.  It  is  observed  from  Theorem  1  that  each  row 
of  the  output  network  must  overlap  each  row  of  the  input  network  in  7 
places  for  a  fault  correction  capability  of  t«6.  Since  the  switches  of 
each  input  row  span  three  groups  of  output  switches  and  since  each  such 
output  switch  group  contains  two  columns  of  one  switch,  the  augmentation 
technique  yields  the  overlap  of  7  only  if  d  <6. 

This  latter  SNP  is  of  interest  from  two  viewpoints.  First,  the  input 
and  output  switch  loading  is  constant  on  all  memory  chips,  i.e.,  the  SNP 
is  CHIP  REGULAR.  This  regularity  (or  near  regularity  in  the  case  of  the 
nonaugmented  version)  permits  the  simple  embedding  of  the  input  and 
output  networks  within  the  memory  chips.  Second,  although  there  are 
some  patterns  of  t+1,  t+2,  ...  chip  failures  that  are  not  correctable, 
the  number  of  such  offensive  patterns  for  large  values  of  z  is  small. 

In  Sections  A. 3. 3.6  and  A. 3. 3. 7  we  discuss  a  realization  with  embedded 
switches  and  an  analysis  of  the  correction  capability  of  the  SNP  beyond 
the  guaranteed  limit. 

A3. 3.5.  MULTI-LEVEL  NETWORKS 

We  thus  see  that  there  are  SNPs  ;hat  handle  all  combinations  of  t  chip 
failures  at  a  cost  of  approximately  kzdt  switches,  where  k  is  a  constant 
between  0.5  and  1.  This  is  certainly  a  tolerable  erst  for  a  relatively 
small  number  of  chip  failures,  e.g.,  up  to  8.  However,  if  a  large 
memory  employing  such  reconfiguration  techniques  is  to  function 
unattended  for  a  mission  of  a  year  or  more,  it  might  be  necessary  to 
handle  20  or  more  chip  failures.  In  this  case  the  switch  cost  can 
become  a  significant  fraction  of  the  total  memory  cost.  As  discussed 
below,  the  switch  cost  can  in  this  case  be  reduced  by  replacing  a 
single-level  network  by  a  multi-level  network.  The  discussion  below  is 
brief,  since  a  previous  paper  (Goldberg  et  si.  68)  pursues  the 
multi-level  case  in  jreat  detail  —  although  for  a  different 
application. 


CONTROL 

LINES 


DATA 

LINES 


Fig.  A3. 6  A  multi-level  SNP. 


Figure  A3.b  illustrates  the  lasic  form  of  the  SNP.  The  input  network 
illustrated  in  Figure  A3. A  requires  only  z(d+t)  switches,  a  cost  that  is 
not  excessive  for  all  reasonable  values  of  t.  Thus  the  SNP  of  Figure 
A3.6  is  assumed  to  have  this  same  input  network.  However,  the  costly 
output  network  of  Figure  A3. 4  crn  be  avoided.  Recall  that  the  setting 
of  the  switches  in  the  first  row  of  the  input  network  activates  d 
columns  among  the  first  d+t  columns.  It  is  the  role  of  the  output 
network  switches,  corresponding  to  this  set  of  d+t  columns,  to  funnel 
the  activated  columns  into  the  set  of  d  output  linen.  Similarly,  the 
second  row  of  the  input  network  activates  d  columns  in  the  set 
d+1 ,  d+2,  ...,  2 d+t,  and  so  on  for  the  remaining  z-2  input  rows.  Hence, 
the  output  network  function  can  be  realized  by  a  set  of  z 
order-preserving  (OP)  networks,  each  of  which  performs  the  funneling 
operation  a:  described  above.  (Actually,  for  this  memory  application, 
the  networks  need  not  oe  order  preserving,  since  tne  order  of  memory 
chips  within  a  block  of  memory  is  not  critical.  However,  if  we  require 
an  efficient  network,  we  have  always  been  able  to  find  an  OP  network  as 
efficient  as  a  comparable  non-OP  network.)  The  first  OP  network  has  as 


input  columns  1,  2,...*  d-rt,  the  second  d+1 ,  d+2,  . ..,  2d+t,  the  third 
2d+l ,  2d+2,  ...,  3d+t,  and  so  on.  EacF  OP  network  yie’ds  d  bytes.  The 
ith  bytes  from  each  of  these  z  networks  are  ORed  together  bytewise 
(e.g.,  by  wired  ORs)  to  form  d  bytes  at  the  output. 

In  Goldberg  et  al.  (68) ,  a  procedure  is  given  for  synthesizing  such  an 
OP  network  as  an  interconnection  of  two-input,  two-output,  two-state 
primitive  cells  as  shown  in  Figure  A3. 7.  Depending  on  the  state  of  the 
cell,  the  inputs  are  interchanged  or  merely  directed  through  the  cell. 

We  have  described  a  recursive  procedure  for  developing  the  network,  as 
illustrated  in  Figure  A. 3. 7.  At  the  input,  r (d+t-2) /2l  cells  and  at  the 
output  r(d-2)/2l  cells  flank  two  smaller  networks.  The  upper  network  is 
an  OP  network  of  '  (d+t) /2  1  inputs  and  Td/21  outputs,  while  the  lower  is 
an  OP  network  of  L(d+t)/2J  inputs  and  Ld/ZJ  outputs,  where  "LxJ "  is  the 
Targest  integer  contained  in  x.  Each  of  these  networks  is  replaced  by  a 
similar  three-layer  construction,  and  so  on.  Eventually,  there  is  a 
degenerate  requirement  for  an  OP  network  of  p  inputs  and  1  output.  Such 
a  network  is  easily  realized  as  a  simple  linear  array  of  p-1  cells. 


I»T<| 


Fig;.  A3 . 7  Decomposition  of  the  order  preserving”  network. 
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The  number  of  cells  required  for  this  OP  .network  is  (^d+tjlog^t,  where  C 
is  approximately  one-half.  Thus,  the  number  of  cells  in  the  output 

network  (which  still  dominates  the  input  network)  is  Cz(d+t)log  t,  which 

represents  a  saving  of  t/log  t,  compared  with  the  single- level  cross-bar 
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realization.  For  t>  8,  the  multilevel  version  becomes  more  economical. 
Techniques  for  setting  up  the  OP  network  and  an  approach  to 
incorporating  fault  tolerance  within  it  are  discussed  in  Goldberg  et  al. 
(68). 

In  conclusion  we  have  determined  that  the  switch  cost  for  reconfiguring 
the  chips  of  a  memory  is  small  when  compared  with  the  total  memory  cost. 
In  addition,  we  have  shown  that  the  algorithms  for  deciding  which 
switches  are  to  be  set  can  be  simple  in  certain  cases. 

A3. 3.6.  .«  NONSEPARAhLE  NETWORK  WITh  EliUEDDEL)  SWITCHES 

As  mentioned  previously  the  switches  in  the  nonseparal  le  SNP  of  Figure 
A3. 5  can  be  embedded  within  the  memory  chips.  In  the  augmented  version 
of  Figure  A3. 5,  a  given  chip  can,  by  virtue  of  the  input  switching 
network,  receive  an  activation  signal  from  one  of  three  control  lines, 
or  be  disconnected  from  all  control  lines.  Similarly,  by  virtue  cf  the 
output  switching  network,  the  chip  can  be  switched  onto  one  of  two  data 
lines,  or  be  disconnected  from  all  data  lines.  The  process  of  embedding 
the  switching  within  the  chips  can  be  seen  by  reference  to  Figure  A3. 8. 
Each  chi >  has  ns  inputs  three  control  lines,  and  as  outputs  two  data 
lines.  An  activation  select  switch  makes  the  connection  to  one  of  three 
control  lines,  or  to  a  fourth  vacuous  input.  Similarly  a  data-line 
select  switcii  makes  the  connection  to  one  of  two  data  linei  or  to  the  * 
vacuous  output. 

Figure  A3. 9  shows  the  connections  of  the  array  of  chips  to  the  control 
and  data  lines  for  the  same  parameters  as  the  SNP  of  Figure  A3. 5, 
namely,  z»6,  d«4,  s»6,  t“5  or  6.  When  the  dotted  line  connections  to 
the  data  lines  are  present,  t“6  faults  can  be  handled;  otherwise,  t»5. 


Fig.  A3. 8  Memory  chip  with  components  of  input  and  output  switches 


Data  Lii.es 


Fig.  A3. 9  Organization  of  nonseparable  SNP  with 


embedded  switches 


If  is  convenient  to  view  the  last  column  of  chips  as  spares  —  i.e., 

with  all  chips  operative,  this  last  column  of  chips  remains 
disconnected.  As  failures  occur,  the  spare  chips  a  e  brought  into 
service.  V.e  have  developed  an  algorithm  that  determines  the  appropriate 
switch  settings  for  any  correctable  fault  pattern.  The  algorithm  is 
more  complicated  than  the  algorithm  for  the  separable  case,  and  may 
require  a  substantial  reorganization  of  the  memory  blocks  subsequent  to 
a  failure,  including  operative  blocks.  The  span  of  the  reorganization 
can  be  sin  wn  (Goldberg  et  al.  68)  to  be  related  to  the  clustering  of  the 
chip  failures  in  the  array.  That  is,  if  the  failures  are  spread  out 
over  the  array,  relatively  little  reorganization  is  required. 

A3. 3. 7.  ANALYSIS  OF  CORRECTION  CAPABILITY  IN  RLGLLAK  SNPs 

The  organization  of  the  type  depicted  in  Figures  A3. 5  and  A3. 9  exhibits 
more  spares  than  the  guaranteed  fault-correction  capability,  however, 
in  these  organizations  a  ^arge  fraction  of  the  fault  patterns  containing 
f  failures,  t+I  <f  <s  are  ii  deed  correctable,  in  this  section  we 
present  some  approximate  upper  and  lower  bounds  on  the  fraction  of  such 
faults  that  are  correctable  for  the  case  of  what  we  define  below  to  be 
I/O  regular  SNPs.  The  derivation  of  these  bounds  is  given  below. 

Ue  define  the  function  c(f)  to  be  the  fraction  of  patterns  containing  f 
faulty  chips  that  cannot  be  accommodated  by  reconfiguration.  In  a 
memory  organization  «.ith  z  rows  each  of  d  bytes,  there  are  p«zd  unique 
paths  that  must  be  established  between  the  control  lines  and  the  data 
lines.  The  input  and  output  switching  networks  must  be  set  so  that  each 
path  contains  a  (unique)  nonfaulty  chip. 

In  estimating  c(f)  we  define  a  ROUTE  to  be  the  set  of  paths  between  a 
given  control  line  and  a  given  data  line,  and  make  the  assumption  that 
each  route  to  be  served  contains  e  paths,  and  that  e  is  a  constant  for 
each  route.  This  is  the  definition  of  an  I/O  REGULAR  SNF.  The 
nonaugmented  SNP  of  Figure  A3. 5  is  I/O  regular  with  e«6,  but  not  chip 
regular.  The  opposite  is  true  for  the  augmented  version.  (In  particular 
some  routes  in  the  augmented  version  contain  seven  paths  while  others 
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contain  eight.)  Thus,  the  bounds  derived  below  are  only  exact  for  the 
nonaugmented  SNP.  However,  they  represent  lower  bounds  for  the 
augmented  case,  provided  tne  lower  applicable  value  of  e  is  used. 


Clearly,  we  have  the  following  special  cases:  r.( f)  -  0  for  f  <e,  for 
clearly  no  route  is  deprived  of  all  of  its  paths.  On  the  other  hand,  if 
we  consider  the  case  where  all  spares  are  used,  then  c(f)“l  for  f  >s. 

The  development  of  estimates  (or  bounds)  for  c(f)  reduces  to  the  cases 
between  these  two  extremes. 


A  particular  fault  pattern  of  f  faulty  modules  will  net  be  tolerated  it 
and  only  if,  for  all  i  <f,  it  contains  a  sub-pattern  such  that  all  but 
(i-1)  or  Jess  modules  included  in  i  routes  are  in  the  sub-pattern. 

If  we  denote  by  c  ,  the  probability  of  the  i  term  above,  then 
1  -  c(f)  ■  (1  ■  CjHl  -  Cg)  •••  (1  ”  ct)  ... 

For  small  values  of  c  ,  a  sufficiently  close  approximation  is 

c(f)  +  c2  +  ***  (A13) 

We  introduce  tne  concept  of  'overlap'  X^  defined  by 

Xij  »  number  of  modules  that  serve  routes  i  and  j  in  common,  (A14) 


and  also 


X  =  MAX  (X  ) 

uj  j 


(A15) 


The  value  of  cL  can  be  computed  for  rrsjular  structures  (i.e.,  those  for 
which  e  is  a  constant  for  all  routes). 

f 

Given  a  pattern  of  f  faults  the  number  of  sub-patterns  of  size  e  is  O  . 

There  exist  just  p  patterns  of  e  faults  that  will  not  be  tolerated  out 
of  a  total  number  of  patterns  of  e  faults  of  (g)  ;  therefore: 


Pf  1 

(f-e)!  me 


(Alb) 
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For  m»  e,  we  have  the  approximation 


(A17) 


Since  C£  ,  C3  ,  ...  arc  non-negative ,  (A17)  is  an  approximate  lower 
bound  of  c(f). 


In  obtaining  an  expression  for  c  g *  we  need  to  consider  the  size  of  a 
pattern  that  will  not  be  tolerated  because  only  one  module  remains  of 
those  that  are  included  in  two  routes.  Given  two  routes  i  and  j,  the 
minimum  pattern  to  disable  one  of  them  because  of  commonality  of 


modules  to  them  is  L^j  ,  where 


Lu  ■ 


(Alb) 


For  the  case  where  no  overlap  exists  for  routes  i  and  j[  i.e.  X^  »  0, 
the  disabling  is  of  the  type  considered  under  the  deriv  tion  of  c 
above.  We  define  the  parameter  L  by 


L  -  MIN  (L  )  -  2e-X-l 


(Q 


(A19) 

sub-patterns  of  size  L, 


'ij* 


Given  a  pattern  of  f  faults,  there  exist^ 

We  consider  each  pair  of  routes  i  and  j.  For  each  such  pair  the  fault 
pattern  will  be  tolerated  if  and  only  if  it  does  not  contain  a 
sub-pattern  of  size  L^j  included  in  the  set  of  modules  of  number  I.ij+1 
that  serve  the  two  routes  i  and  j.  For  these  (Ijj  +1)  modules,  there  are 
(L..+1)  ways  of  selecting  a  fatal  sub-pattern.  Within  the  whole 


ij 


are 


structure  there  are^^jsub  patterns  of  size  ,  and 
included  in  the  fault  pattern  being  considered.  The  pair  of  routes  will 


survive  with  a  probability  Q  .  ,  where 

i  J 

q  ,  t.  A 

Qu  ~  1 


(A20) 


(‘J 


the  approximation  being  valid  if  L  «n.  The  probability  Q  of  all  pairs 
surviving  is  given  by 
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whence 


(A21) 


inn 


o  =  i-c 

!li'W  1J  2 


i  £  £  (Vl)  f!  <,-!■„)  I 


1*1 

Each  tern  in  the  double  series  can  be  expanded  in  the  form 


(A22) 


<L 


ij+1)(n)(  m  )  (n-L^+l) 


(A23) 


As  L  «L  ,  the  replacement  of  Ljj  by  L  in  (A23)  will  result  in  fewer 
terms  in  the  product.  We  can  therefore  derive  an  upper  bound  for  C2  as 

(Ul)  f!  (m-L)'. 

-  c  (A2A) 


skT.l 


2-2  L  L  (f-L)  !m! 

i  yi 

which  for  L«r  yields  the  approximation 

P(p-l)(Ul)  f  !  ( m-L ) ! 
C2  w  2(f-L)'.m! 


(A25) 


Consider  now  the  expression  for  c(f)  the  probability  of  non-coverage, 
i.e.  from  (A13) 

c(f)  =  c  +c+  ••• 

1  (A26) 

On  intuitive  grounds,  we  say  that  this  series  is  strongly  converging, 
for  if  it  were  not  so,  the  implication  would  be  that  a  fault  pattern 
would  bo  more  probable  to  be  not  covered  because  of  interaction  among 
(i+1)  routes  than  between  i  points.  Fcr  values  of  i  small  compared  to 
m,  this  implies  that  in  going  from  i  to  (i+1)  there  is  a  greater 
probability  of  the  new  fault  being  strongly  connected  than  being 
disjoint,  which  for  small  i  is  absurd.  We  therefore  consider  only  the 
first  two  tenrn;  of  the  series  to  obtain 

(A27) 


or 


c (f )  «  c1+c2 


C1  <  c<f)  <  Ci+C2 


(A2C) 


Note:  in  computing  values  of  c\  arid  c2  ,  the  c2  is,  for  reasonable 
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< 


cases,  significantly  smaller  than  c^  ,  lending  credence  to  the  intuitive 
argument  above.  The  bounds  on  c(f)  are  therefore: 


pf ! 

(f-e) !me 


<  c(f)  < 


_ ElL 

(f-e) !me 
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2 

P  (Ul)f! 

2  m1-  (f-L)  ! 


(A29  ) 


A3. 3. 8.  REGULAR  SEPARAULE  SWITCHING  NETWORKS 


We  consider  the  design  of  input  (SI)  and  output  (SO)  switching  networks 
which  are  separable  and  are  also  uniform  in  that  the  fan-in  and/or 
fan-out  of  each  unit  of  each  part  of  the  system  (decoder,  chip,  etc.)  is 
the  same. 

Define : 


b 

m 

number 

of 

inputs  to  SI 

d 

m 

number 

of 

outputs  from 

SO 

s 

m 

number 

of 

spare  chips 

t 

m 

number 

of 

faulty  chips 

to  be  tolerated  ■  s 

mi 

m 

numoer 

or 

cells  in  each 

row  of 

SI 

iii 

o 

m 

number 

of 

cells  in  each 

row  of 

SO 

ki 

m 

number 

of 

..ells  in  each 

column 

of 

SI 

k 

o 

m 

number 

of 

cells  in  each 

column 

of 

SO 

in 

m 

number 

of 

chips  total  » 

bd+s 

Separable  networks  have  the  desirable  property  that  a  simple  algorithm 
is  known  for  setting  the  switches  in  the  presence  of  arbitrary  fault 
patterns.  Most  separable  networks  known  to  date  have  the  disadvantage 
that  the  loading  on  the  parts  of  the  system  is  nonuniform.  We  develop  a 
set  of  necessary  conditions  for  a  network  tc  be  regular  separable. 

Using  these  conditions  a  number  of  potentially  regular  separable 
networks  (KSN)  have  been  found,  some  of  which  are  indeed  RSN.  No  cases 
have  been  found  of  a  network  satisfying  all  the  conditions  and  not  being 
RSN.  We  conjecture  that  all  the  cases  are  RSN. 
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CONDITION  1— REGULARITY  OF  SI 


The  total  number  of  cells  in  SI  is  bm.  .  The  total  number  of  modules  is 
m=bd+s .  Clearly. 


_ i  =  k.  (an  integer) 

bd+s  1 

by  Theorem  1  of  Section  4.2,  it  follows  that  m  .  -d+s: 

b(d+s ) 

(bd+s)  =  ki 


(A  30) 


(A31) 


or 


(k^Dbd 

“  ~(b-k. ) 


(A32) 


CONDITION  2— REGULARITY  OF  SO 

The  total  number  of  cells  in  SO  is  ir.  d.  Clearly, 

m  d/(bd+s)  =  k  where  I  <  k  <  d  and  k  integer  'A33) 

°  o  o 


CONDITION  3 — SEPARABILITY 

Re  restate  Theorem  2  of  Section  4.2  on  necessary  and  sufficient 
conditions  on  SO. 

"Assume  a  valid  SI,  then  the  combination  of  SI  and  SO  is  separable  if 
and  only  if  the  union  of  every  set  of  j  rows  of  SO  (j-l...d)  overlaps 
eacli  row  of  SI  in  at  least  s+j  places." 

Consider  any  row  of  SO  and  apply  a  test  for  the  case  j»l.  The  overlap 
with  the  first  row  of  SI  must  be  at  least  s+i,  Ve  r.us*:  therefore 
allocate  at  least  s+1  cello  to  those  columns  where  the  first  row  of  SI 
has  cells.  Call  this  allocation  A^  In  making  the  allocation  there  are 
ki  AlceUs  i'1  the  allocated  columns  of  ST.  Consider  another  row  of  SI, 
choosing  that  one.  that  has  a  minimum  number  of  cells  in  the  columns 
already  allocated.  There  are  (ki-l)A1  available  cells  in  the  allocated 
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columns  to  be  shared  over  the  remaining  b-]  rows.  Therefore,  there 
exists  at  least  one  row  of  I  which  contains  only  L(kj-1 )A^/(b-l )J 
cells  in  the  allocated  columns.  Choose  this  row.  We  must  allocate  more 
cells  of  the  row  of  SO  to  this  row  of  SI.  Specifically  we  must  allocate 
at  least  (s+1) - L(k^-l)A^/(b-l)J  .  Specifically  we  must  allocate  at 
two  rows  is  therefore 

A2  =  A1+<®+l>-Uki"1)A1/(b-l)J  (A3A) 

Using  the  above  reasoning  to  successive  rows  of  SI  we  can  develop  the 
general  form 

A  =  s  +  1 
1 

Ai  =  Aa-i)+(s+1)~L(\*-i)k  “  <s+na-i>>/<bHe+i>j  i  2...b 

(A35) 

The  necessary  condition  on  m0  becomes  m0>ag.  Note  tliat  Condition  3 
is  necessary,  but  to  prove  sufficiency  using  Theorem  2  of  Section  4.2, 
it  is  required  to  consider  every  set  of  j  rows  of  SO.  It  is,  however, 
conjectured  that  for  regular  networks,  Condition  3  may  be  sufficient. 

ho  cases  that  satisfy  Condition  3  that  are  not  regular  separable  have 
been  found. 

To  illustrate  the  test  consider  the  case  b”6,  d*4,  s*=6,  k=2,  m*10.  Then 
A1  ”  7,  A2  -  13,  Ag«  17,  A4"  20,  A  5  •=  21,  A  6  -  21,  whence  >  21. 

The  scheme  used  to  find  RSNs  was  programmed  and  the  results  of  a  small 
run  are  shown  in  Tables  A1  and  A2.  Note  that  the  solution  m  *  bd+s 
which  is  a  totally  full  SO  is  trivial  and  is  not  shown. 


We  conclude  that  regular  Separable  Network  exist,  but  such  networks 
contain  a  very  high  proportion  of  cells  in  the  switch,  leading  to  high 
fan-out  and  fan-in.  However  such  networks  can  be  designed  to  enable 
reconfiguration  in  the  presence  of  a  large  number  of  faults. 
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Table  A1  Potential  RSNs  for 


maximum  of  three  values  of  m  are  shown 
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ABSTRACT 

This  p.ipvi  considers  codes  with  t'.i'l l \  r  >  2  which 
are  capable  of  correcting  arbitrary  arithmetic  errors  In 
any  radix  r  digit.  If  cacti  radix  r  digit  represents  a 
b>tc  of  t>  binary  digits  (e.g.,  r  *  2k>.  these  codes  cor¬ 
rect  ar,  combination  of  errors  occurring  in  the  b  binary 
digltr  of  any  single  byte.  A  theoretical  basis  for 
these  codes  la  presented,  along  sith  practical  consider¬ 
ations  regarding  their  applicability. 

I  INTRODUCTION 

This  paper  la  concerned  slth  error  detection,  error 
correction  and  error  location  for  multiple  errors  within 
a  particular  byte  of  an  arithmetic  unit,  and  la  motivated 
by  several  observations.  First,  It  la  possible  to  ob¬ 
tain  byte  error-correcting  arithmetic  code*  slth  los  code 
redundancy.  Second,  It  la  possible  to  provide  high  sys¬ 
tem  availability  and  relatively  maintenance-free  opera¬ 
tion  through  autonomous  replacement  slth  spares.  For 
certain  applications  It  la  desirable  to  replace  not  an 
entire  processor  or  arithmetic  unit,  bat  rather  one  of 
several  Identical  aut -units.  Thus  byte-slicing  la  at¬ 
tractive.  Third,  by  c-allclng  la  also  naturally  allied 
slth  fast-carry  logic,  e.g.,  carry  look-ahead  over  bytes 
(and  even  etthln  bytes).  Fourth,  LSI  technology  la  suit¬ 
able  for  realisation  of  a  byte  of  logic  on  a  chip.  Fifth, 
LSI  technologies  often  give  rise  to  multiple  errors  on  a 
chip  resulting  from  a  Simple  fault.  Thus  higher  radix 
(byte)  arithmetic  coding  may  be  highly  effective:  slth 
chips  corresponding  to  uytes,  multiple  errors  In  arith¬ 
metic  within  a  byte  may  then  be  economically  corrected. 
Besides,  single-bit  error-correcting  codes  are  Inade¬ 
quate  for  the  multiple  errors  vhlch  may  arise  from  fast- 
carry  logic.  Location  of  the  faulty  byte-slice  and 
autonomous  replacement  with  spares  la  also  facilitated 
by  the  byte  coding. 

In  this  paper  previous  results  of  Peterson  and  of  Rao 
and  Trehan  for  perfect  alngle-error-correct Ing  arithme¬ 
tic  codes  are  generalized  to  higher-radix  number  systems, 
A  single  arithmetic  error  In  a  radix  r  representation 
la  of  the  form  ±arJ,  0  <  a  <  r.  It  la  shown  here  that 
all  such  errors  are  correctable  by  an  AN  code  with  gen¬ 
erator  A  of  the  form  (r-l)p,  where  p  is  a  prime  greater 
than  r  satisfying  certain  specified  conditions.  These 
results  also  apply  directly  to  corresponding  systematic 
codes  (e.g.,  bi-realdue  codes  with  residues  r-1  and  p, 
and  gAN  codes).  Further  results  are  also  given  for 
other  Interesting  (bu*  non-perfect)  codes. 

The  results  of  this  paper  are  potentially  suitable  for 
use  In  a  byte-organized  processor,  e.g.,  using  one  chip 
for  each  b-blt  byte  (representing  a  radix  r  digit)  of 
tbe  processor,  where  r  »  2^ ,  r  »  10,  etc.  Thus  it  la 
possible  to  correct  any  combination  of  bit  errors 


resulting  from  errors  In  any  single  byte  position,  that 
la.  any  arbitrary  single-digit  arithmetic  error  in  the 
higher  radix  r.  As  a  consequence,  certain  known  bit 
correcting  codes  sre  seen  to  be  byte  correcting  as  well. 
Examples  are  Included  here,  along  with  a  discussion  of 
the  appl tcabl 1 1 ty  of  such  codes  In  f aul t-tolerant  com¬ 
puting  systems.  %o  determine  the  set  of  all  possible 
errors  capable  of  arising  from  various  faults,  a  careful 
and  thorough  analysis  la  required,  such  as  the  one  con¬ 
ducted  by  Langdon  and  Tang  (12|  fwr  adders  employing 
carry  look-ahead  between  and  within  bytes.  Their  analy¬ 
sis  establishes  that  the  errors  In  carry  look-ahead  ad¬ 
ders  resulting  from  tingle  faults  are  frequently  not  of 
the  form  ±2  .  Therefore  the  binary  singlc-error-cor- 
rectlng  codes  are  not  effective  In  such  cases,  especially 
in  byte-per-chlp  realizations.  Here  we  assume  that  the 
bvte  adders  can  be  designed  In  such  a  way  that  the 
carry-out  (look-ahead)  logic  circuit  la  Independent  of 
Ihe  rest  of  the  logic  (namely,  the  Internal  carry  genera¬ 
tion,  sum-byte  logic,  etc.).  Consequently,  wc  allow  any 
error  combination  In  the  sum  byte  or  In  the  carry-out 
but  not  In  both  (unless  that  combination  la  equivalent 
to  an  error  In  one  or  In  the  other).  Specifically,  the 
byte-correcting  codes  discussed  here  are  capable  of  cor¬ 
recting  any  additive  error  Involving  a  single  digit 
(byte)  of  radix  r,  of  the  form  arJ,  where  or  is  a  positive 
or  negative  additive  error  of  magnitude  a  •  |o|,  0<  a  <  r, 
and  where  )  Is  tha^posltlon  of  the  radix  r  byte  processor 
In  error,  0  <  )  <  n.  Such  errors  are  characterized  as 
single  arithmetic  errors  in  radix  r  by  Peterson  [17]  and 
have  arithmetic  (Peterson)  weight  one.  More  precisely, 

In  adders  using  radix-complement  (or  dlmtnlshed-radlx- 
complement)  arithmetic,  i  single  byte  (arithmetic)  error 
E  la  defined  as  an  error  of  modular  weight  one  [211,  and 
la  given  by 

E  -  or^  or  m  -  <*r^. 

Where  0  <  |ar |  <  r,  0  <  J  <  n ,  and  m  w  r"  (m  .  r"  -  1  for 
the  diminished  radix  complement  case). 

II  BYTE- ERROR  DETECTION 

Error  detection  techniques  arc  well  known  using  ”anm 
codes  (which  are  nonsystema* lc)  [4,5,17]  and  "(N,  |n|a)" 
residue  codes  (which  are  systematic  and  separate)  [1,4, 

18, ID],  Single-byte  error  detection  arises  whenever  the 
base  A  la  an  Integer  greater  than  r  that  is  relatively 
prime  to  r.  Two  suitable  choices  are  r  ♦  l  and  r^  -  1. 
When  r  w  2b,  tne  check  base  r  -  1  la  also  Interesting, 
particularly  for  the  simplicity  of  its  Implementation; 
however.  In  this  case  not  all  single  b;  te  errors  arc  de¬ 
tected  (e.g.,  an  error  changing  0  to  i  -  1  leaves  the 
residue  unchanged),  although  the  moit  probable  errors  |1) 
and  a  very  high  percentage  of  all  single  byte  errors  arc 
detected.  Such  a  residue  code  la  used  (with  b  •  4)  in 
the  JPL  STAR  computer  [3], 
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Ill  NEAR -PERFECT  BYTE  “CORRECTING  COOES 

Single-error  correction  in  binary  adders  is  attainable 
•  1th  th«  nonsvstemat  ic  AN  codes  first  studied  by  Dro»n 
|5 j  and  Peterson  fl7j.  and  »ith  the  systcrutic  multi- 
residue  codes  studied  by  Avizicnis  [1,  2.  4|,  Rao  I 
19,  21]  and  Garcia.  Systematic  re  at*  ran  cements  of  the  AN 
codes,  namely,  the  systematic  gAN  codes,  have  been  dis¬ 
cussed  by  Gamer  (9|  and  by  Rao  |2°).  Extension  of  the 
results  of  Bro*n  and  Peterson  to  higher  radices  have 
been  discussed  previously  by  Rao  and  Trchan  J22J,  pri¬ 
marily  for  r  *  3.  Here  me  first  characterize  those  opti¬ 
mal  AN  cedes  in  radix  r  mhlch  are  capable  of  correcting 
arbitrary  arithmetic  errors  in  a  single  (radix  r)  byte. 
These  by tc-correc t lng  codes  are  obtained  by  choosing  A 
of  the  form  (r-l)p,  «here  p  is  a  prime  greater  than  r 
satisfying  certain  specified  conditions.  (Theorems  of 
Peterson  snd  of  Rao  and  Trehan  folios  as  special  cases 
for  r  «  2.  )  Further  theorems  «r.e  developed  *hich  sld  in 
deriving  suboptimal  codes,  ana  examples  are  cited  These 
results  are  immediately  applicable  to  corresponding 
systematic  codes  »lth  the  same  r  and  p  bi-residue  codes 
(N,  |.v|  ,  |S«  |  >  in  radix  r  *ith  residues  r  -  l  and  p, 

and  gAN  codes  »ith  A  «  (r-l)p,  both  of  shich  are  there¬ 
fore  also  byte  correcting.  interesting  byte-correcting 
codes  also  exist  for  some  nonprlmcs  p. 


Theorem  2  Given  'hat  -r.  but  nol  i.  i»  primitive  in  GMp) 
and  that  condition  (1)  is  satisfied,  thru 
(p-1^2  .  t 

M^( A .  3 )  *  - J - .  A  ■  (l-Hp.  <*■*> 

Theorems  of  Peterson  jl7  |  and  ol  Ran  and  f ichan  1221 
folio*  by  setting  r  »  2  and  r  «  3.  re«*prc lively  in 
Theorem  2,  since  condition  (1)  is  valid. 


Corollary  3  (Peterson).  If  -2  but  n.»i  2  it  primitive  in 

or(p).  then  ip-iv*., 

U  (A. 3)  »  - 7 -  ,  A  *  p  . 

2  A 

Co ro 1 1 a ry  4  (Rao  and  Trehan)  If  -3  but  not  3  is  primi¬ 
tive  in  GF ( p ) ,  then 

3(p~l>  2  -  1  ,  A  «  2p  «  (5) 

«3(A,3>  .  - - - 

The  sequence  of  expressions  (4)  and  (3)  extends  teadlly 
to  r  .  4. 


Theorem  5 : 


If  -4  but  not  4  is  primitive  In  GF(p), 
4(p-l>/2  .  , 

H^(A,3)  «  - - - .  A  .  3p  . 


then 

(6) 


For  r  >  4,  however,  the  simplicity  of  (4),  (3)  snd  (6) 
no  longer  exists.  Condition  (1)  is  no  longer  generally 
satlsftable,  snd  we  must  resort  to  Theorem  2.  ("hen 
condition  (1)  Is  not  satisfied.  Theorem  7  below  is  use¬ 
ful.) 


The  reader  Is  assumed  to  be  exposed  to  the  concepts  of 
arithmetic  weight,  arithmetic  distance,  and  Utear  con¬ 
gruences  used  here;  he  may  wish  to  refer  to  Petersoi 
(17)  or  to  Massey  and  Garcia  (II)  for  background. 
Throughout  Sections  lit  and  IV,  p  denotes  s  prime  greeter 
than  r.  The  following  are  obeerved  throughout  this  pa¬ 
per:  Gr(p)  denotes  the  cyclic  (multiplicative)  sub¬ 
group  {rJ(mod  p)},  snd  er(p>  denotes  Its  order;  cf(p)  Is 
also  called  the  order  or  exponent  of  r  In  the  field 
GF(p);  "a"  Is  a  nonzero  radix  r  digit,  0  <  s  <  r  -  1, 
l.e.,  an  element  of  the  field;  •  1  Is  Its  multiplicative 
lnveree,  with  as-1  *  1  (mod  p) ,  Kith  A  »  (r-l)p, 

Mr(A,3)  It  the  maximum  number  of  code  words  In  the  radix 
r  byte-correcting  AN  code  (with  arithmetic  distance  3). 
The  error  synd rone  of  s  given  presumed  word  In  an  AN 
code  Is  the  modulo  A  residue  of  that  word,  e.g.,  0  If 
It  is  a  correct  code  word  AN,  since  every  code  word  has 
residue  zero.  Thus  an  error  crrJ,  0  <  |a  |  <  r,  has  the 
syndrome  OrJ  (mod  A).  With  this  background  the  follow¬ 
ing  theorem  Is  the  basic  theorem  of  this  paper. 


Theorem  ?  is  thus  a  generalized  form  of  the  Petcison 
Theorem  whenever  - r  but  not  .r  la  primitive.  Its  con¬ 
verse  Is  also  true.  The  full  theorems  of  Peterson  snd 
of  Rao  snd  Trehan  also  cover  the  esse  of  »r  primitive 
for  r  v  2  snd  3,  respectively,  for  which  esses  p-1  Is 


in  G^(p) : 


(p-l)/2 


M  (A, 3)  • 


r  •  2,  3, 


(7) 


If  and  only  If  .r  Is  primitive  In  GF(p>.  Unfortunately, 
(7)  docs  not  hold  for  any  r  >  J,  since  r-1  cannot  divide 
r(p-l)  2  4  j  for  s n y  p.  A  counterpart  of  Theorem  1  ex¬ 
ists  In  this  esse,  however,  as  follows. 


Theorem  6:  Given  that  p-1  exists  In  G^lp),  snd  that  con- 


ditlon  (l)  is  satisfied,  then 
cr(p>/2 
r  +1 

for  r  even. 

(8) 

M  (A, 3)  ■  p 

•r(p>/2 

r  ♦  1 

for  r  odd. 

(9) 

2P 


Theorem  1;  For  any  prime  p  >  r,  given  that  p-1  does 
not  exist  In  lr(p)  snd  that  the  condition 

(s-r*  1)»"  * t  Gf(p)  for  ell  a,  0  <  ■  <  r  -  1  (1) 

It  tettsfled,  then  e  (p) 

r 

«r(A,3)  •  - - J-1 -  .  (2) 

Proof  It  found  In  tho  Appendix,  along  with  proofs  of 
other  theorem*  (At  an  example,  the  reaoer  might  try 
r  «  8,  p  *  19,  a  «  2. )  Next  wc  consider  the  special 
case  wh«i  -r  (l.e.,  p-r)  ts  primitive  (l.e.,  (-r)* 

(mod  p)  generates  all  elements  from  1  to  p-1  for  1  from 
0  to  p-2],  but  when  r  la  not  primitive  In  GF(p).  Wc 
know  from  number  theory  (e«g.,  (19])  that  in  this 

«  (p-1  )/2,  while  (-r) ■  -1  (mod  p)  snd 
1  (mod  p)  Therefore  (because  of  the  non- 
prlultlvlty  of  r)  -1  does  not  exist  In  Gr(p),  satisfying 
the  first  part  of  the  hypothesis  of  Theorem  1.  This  pro¬ 
vides  us  with  the  following  useful  result. 


esse  e,.(p) 

(p-1).  2 

r  - 


Theorem  2  specifies  the  existence  (or  nonexistence)  of 
near-perfect  codes  In  which  all  possible  nonzero  syn¬ 
dromes  (omitting  r-2  multiples  of  p)  are  used  to  correct 
the  possible  byte  errors  In  each  of  the  n  »  (p-l)/2 
bytes  of  the  resulting  radix  r  AN  code.  The  codes  covered 
by  (7)  end  by  Theorem  2  (snd  Its  derivatives  (4)-(6))  are 
the  only  nesr-perlcct  byte-correcting  AN  codes.  (Those 
for  r  •  2  ere  perfect.)  The  hypotheses  for  Corollary  3 
snd  Theorem  5  are  true  precisely  when  p  •  81-1  snd  41-1 
are  both  primes  (cf.  (2.3  J,  Theorems  38  snd  39).  Such 
codes  therefore  exist  for  r  •  4  (ss  ell  as  r  ■  2)  when 

p  .  7,  23,  47,  71,  79 .  As  further  examples,  the 

shorteit  nontrivial  nesr-perfect  c  es  for  r  •  5,  6,  7, 

8,  9,  snd  10  have  p  »  11,  19,  31,  71  39,  and  31,  re¬ 

spectively.  The  shortest  nontrivial  .car- per  feet  code 
for  r  «  16  has  p  •  303.  (Note  that  p  xust  be  at  least 
2r-l  for  s  code  to  be  perfect.) 
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Vi  Nt»<  PERFECT  BYTE-CORRECTING  COOES  WITH  ptUKE  p 

For  coapliUDtii,  th«  following  theoren  la  Included  11 
an  ex  teuton  of  Rao  and  Trehan’a  |n)  Theoren  3.  it  la 
sometimes  helpful  In  jeneratlnc  efficient  codea,  particu¬ 
larly  for  r  >  4. 

Theoren  7 :  When  condition  (1)  does  not  apply,  let  J  be 
the  snalleat  possible  positive  Integer  such  that  rJ  •  c 
(nod  p),  where  c  «  (a-r+l)a_1  (nod  p)  for  sons  a  In 
0  <  »  <  r-1.  That  la,  c  ■  -aa*1,  where  a  -  r-l-a.  Then 


Mr(A,3) - X -  • 

A  useful  class  of  nonperfect  byte-correcting  codea  is 
available  when  p  la  a  (Meraenne)  prlne  of  the  forn  2“*-!. 
Corresponding  results  for  nonprlnes  of  thla  ton  are 
given  In  the  next  section.  Residues  of  thla  forn  are 
called  "low-coat”  by  Avlzlenla  (1)  because  of  the  rela¬ 
tive  alnpliclty  of  lnplenentatton. 


Lemma  g  Given  r  «  2°  and  prlne  p  ■  2°  -  1,  p  > 
follows  that  condition  (1)  la  satisfied. 

Theoren  9:  For  r  «  2 


H  d 

u  and  prlee^p  ■  2  -  1, 

M  (A, 3)  .  . 

r  A 


P  >  r, 


it 


(10) 


Theoren  8  follows  from  Lenina  8  and  Theoren  1.  The  re¬ 
sulting  codea  correspond  to  the  aingle-bit  error  cor¬ 
recting  bl residue  codes  (1,  19]  with  the  given  residues 
r-1  and  p  (prlne),  which  are  thus  seen  to  be  byte  cor¬ 
recting  aa  well  aa  bit  correcting. 

V  EXTENSION  TO  N0NPR1KES  p 

The  foregoing  theorens  all  aaaune  that  p  la  a  prlne. 
However,  byte-correcting  codea  in  fact  exist  for  many 
ronprlnea  p — although  none  la  near-perfect.  An  exanple 
la  r  «  8,  p  -  11*13,  AMg(A.3)  •  260  -  1,  which  arises 
fron  an  extension  of  Theoren  1.  For  low-cost  residues, 
Theoren  8  la  generalized  below  (with  in  additional  con¬ 
dition)  to  ccrtatn  nonprlnes  p  -  2d  -  1.  For  this  case, 
nontrivial  codea  exist  for  every  d  >  3  other  ti  vn  4  and 
6,  for  at  least  aone  r  ■  2*  ■>  4. 


Theoren  10:  With  r  ■  2°  and  p  ■ 
not  necessarily  prlne),  with  ged 


2s  -  1  (d  >  b,  with  p 
(r-1,  p)  w  l,  and  with 


A  »  (r-l)p,  let  f  be  the  1.  -geat  Integer  1  <  f  <  d  for 
which  2f  -  1  (also  not  necessarily  prlne)  la  a  divisor 
of  p.  Then 


B  (A. 3) 


d 

r  -  1 


Iff  r  < 


1  <  f.  <  dj.  A  alnple  exanple  of  such  a  byle-correcting 
code  (at  r-1  >3,  p  •  49  •  72  (a  base  7  residue  calcula¬ 
tion.’),  with  AM  (A, 3)  »  2*2  -  i.  Thu  cod,  u  cl0„  to 
near-perfect  (cf.  Theoren  42  of  (23J).  (It  la  related 
to  the  nore  redundant  code  with  p  •  7.127.)  Such  codea 
Include  certain  of  the  nultl-restdue  codea  (1,21],  in¬ 
cluding  not  Just  those  with  prlne  residues  2^1  -  1 

( tj  •  1),  but  also  aone  with  nonprlnes.  A  ample  ex¬ 
anple  of  the  latter  type  has  r-1  ■  7,  p  •  31*233,  for 
which  AM|(A,3)  »  2120  -  1.  (Note  that  the  code  with 
r-1  w  7,  p  »  13-31  has  AM  (A, 3)  •  3*230  *  1,  although 
the  trl residue  code  has  AM  (A, 3)  >  2*°  -  1  for  the  sane 
A.)  Thus  the  greedy  algorithm  of  amply  trying  nultl- 
-ealdue  codes  does  chew  off  various  lleraennary  codea 
that  are  byte  correcting. 

VI  SOME  POTENTIALLY  USEFUL  EXAMPLES 

Arithmetic  coding  la  of  interest  to.'  words  of  length  up 
to  about  64  (or  possibly  128  for  apollcailona  such  aa 
double  precision  and  aultlpllcati w).  Table  1  illus¬ 
trates  sone  byte-correcting  codea  for  r  •  4,  8  and  16, 
and  for  r  »  10.  Vrluea  of  p,  AM  (A, 3),  i,  p  and  p  are 
given  In  the  table,  wltb  the  following  neanlng.  The  AN 
codea  for  A  ■  (r-l)p  can  be  used  to  encode  up  to  U  (A, 3) 
code  words;  p^  la  the  effective  bit  redundancy  required  by 
A.  The  given  value  of  n  la  such  that  2n  la  the  largest 
power  of  two  contained  in  AMr(A,3)  +  1.  Thus  n  -  p  is 
the  effective  number  of  binary  Information  digits  in  the 
AN  code. 

r*n  the  other  hand,  the  bl- residue  codea  with  residues 
r-1  and  p  can  be  used  to  encode  up  to  AM  (A, 3)  code 
words;  Pg  la  the  bit  redundancy  required  by  these  codes. 
The  given  value  of  n  la  thus  the  effective  nuatier  of 
binary  information  digits  In  the  bl-resldue  code.  If 
syndromes  are  computed  In  bl-realdue  form  In  both  cases, 
then  corresponding  byte  errors  have  Identical  syndromes. 
(Note  that  when  only  one  value  of  pA  and  Pg  la  given,  It 
la  the  value  of  both.)  The  results  also  apply  directly 
to  the  systematic  gAN  codea  (6,29),  providing  a  permuted 
subcode  of  the  AN  code  with  2n*bA  code  words. 

Near-perfect  codes  In  the  table  (derived  from  Theorems 

2  and  3)  are  indicated  with  asterisks.  The  remaining 
codea  In  the  table  are  all  derived  fron  Theorem  7,  with 
the  exception  of  those  with  r  »  8,  p  »  17,  and  with 

r  »  16,  p  a  31  (which  arise  from  Theorems  6  and  9,  re¬ 
spectively),  and  that  wltb  r  >  16,  p  •  71  (which  arises 


As  a  consequence  of  Theorem  10,  some  but  not  all  aingle- 
bit  error  correcting  blrealdue  codes  with  residues 
r-1  ■  2^*1  and  p  a  &-1  are  In  fact  also  byte  correcting, 
above  and  beyond  those  covered  by  theorem  9.  For  p  - 
233,  for  example,  the  code  with  r  a  8  la  byte  correcting, 
while  the  codes  ilth  r  .  32  and  128  are  not  (unless 
truncated  to  about  half  tjieir  length).  For  p  a  2047  a 
23-89,  the  codes  for  all  r  .  2b,  i  <  li  <  n(  ,r,  byt, 
correcting. 

If  p  la  generalized  to 

->  dl 
p  a  n  (2 

ml 

some  additional  simply  tmplcmentablc  and  more  efficient 
byte-correcting  codes  arise,  with  each  d  >  t» ,  with  pair¬ 
wise  gcd'a^all  one  among  the  d  ’s  and  b,  where  each 

value  of  2  1  -  1  satisfies  r  <  (2d‘  -  l)/(2 ‘l  -  j» 
f  4  d  * 

there  2  *  1  li  the  largest  auch  dlviaor  of  2  1  -  1, 


1) 


S*1 


mi 


from  Theorem  1,  but  which  Is  closely  related  to  a  Theorem 
10  code  with  p  »  311,  p  ■  13)  Table  II  summarizes  a 
few  selected  codea  with  p  glvan  by  (11.)  Since  ncar- 
perfcct  codea  are  seen  to  be  fairly  sparse  for 
r  “  2*  >  8  and  reasonable  n,  the  codea  of  the  form  of 
(11)  are  often  competitive  In  terns  of  redundancy,  be¬ 
sides  having  Implementation  advantages. 


p 

Value*  of  n  for 

r  -  4 

r  -  8 

r  »  16 

311 

18 

— 

36 

2047 

22 

33 

44 

217-1 

34 

31 

68 

72;7-127 

42 

- 

— 

31-127 

70 

103 

140 

Table  11. 


Examples  of  single-byte  correcting  arithmetic 
codes  with  simple  syndrome  generation. 


Tabic  I.  Surm ry  of  by te-correctlng  arithmetic  codes  for  various  radices  and  small  primes  p. 


VI  ERROR  LOCATION 

practice  there  may  be  no  need  for  error  correction 
(apart  from  real-time  criticalities)  If  the  faulty  byle 
"^•ssor  can  be  Immediately  located  and  replaced  by  s 
spare.  Alternatively  this  byte  processor  could  be 
removed,  with  computation  continuing  either  with  de¬ 
graded  precision  or  with  multiple  precision  opera¬ 
tions.  (Instruction  retry  without  replacement  may  of 
course  be  adequate  if  the  fault  is  tr«nstent  or  inter¬ 
mittent.)  Thus  the  use  of  error-locating  arithmetic 
codes  which  specify  the  byte  processor  in  error  mi  ght 
appear  to  be  very  desliabln.  In  fortunately  (with  ex¬ 
ceptions  noted  below)  almost  all  linear  byte-error  lo¬ 
cating  codes  are  tlso  byte-error  correcting.  This  fol¬ 
lows  from  the  linearity  of  the  syndrome  generation  for 
errors  within  a  byte  post  tion— which  -rrors  fht?refore 
have  dlatinct  syndromes.  Of  course,  error-correcting 
codes  may  be  used  as  error-locating  codes.  (Partial 
error  location  is  discussed  in  (1,4).) 

Error-locating  codes  that  are  not  error  correcting  do 
In  faet  exist:  outright  duplication  and  comparison  has 
thir  property  since  the  lowest  byte  position  exhibiting 
a  discrepancy  is  the  position  in  error — assuming  that 
the  arithmetic  error  »ss  confined  to  a  single  byte. 

(Note  that  duplication  of  n-bit  words  can  be  considered 
as  an  AN  code  in  shich  A  «  2n* 1 . ) 

VI 11  MULTIPLE  BYTE- ERROR  OETECTIO.'  AND  CORRECTION 

AN  codes  for  r  *  2  are  known  that  arc  capable  of  detect¬ 
ing  double  errors  while  correcting  single  errors  (dis- 
trn-e  4,  e.g.,  for  A  «  43),  or  of  correcting  Jouble 
adjacent  errors  (e.g..  for  A  «  41)--see  (17).  Similar 
codes  also  exist  for  r  >  2,  along  with  corresponding 
multi-residue  codes.  As  an  example,  consider  r  ■  4, 
p  *  109.  Using  the  residues  3  and  109  over  2-bit  bytes 
results  In  sn  AN  code  with  single-byte  error  correction 
plus  double-byte  error  detection  with  M  (327.4)  =  9. 

The  corresponding  bi-residue  code  has  up  to  AM  *  2943 
code  words,  or  at  least  11  bits  of  information  with  9 
bits  of  redundancy. 

IX  SOME  IMPLEMENTATION  CONSIDERATIONS 

The  coder,  discussed  here  offer  considerable  flexibility 
snd  effectiveness  in  use  with  byte-sliced  arithmetic 
units  [7,24],  permitting  replacement  of  faulty  byte 
processors.  The  cost  of  reliably  switching  the  spares 
does  not  seem  to  be  excessive  (e.g.,  (19|  ).  However, 
s  careful  comparison  remains  to  be  made  with  alternative 
schemes  Involving  replication,  comparison,  and  diagnosis, 
under  various  system  assumptions.  In  any  event  the 
total  system  effect  must  be  considered  in  time  and  in 
equipment  complexity.  Results  of  Rao  [  19  J  for  r  «  2 


seem  to  Indicate  that  about  a  100'"  Increase  In  equipment 
(l.e.,  effecti  ely  equivalent  to  duplication  of  the 
arithmetic  unit  suffices  to  provide  byte-error 
correc  t Ion . 

Various  arguments  concerning  the  Implementation  of 
arlthmeilc  codes  are  also  relevant  here  (cf.  [1,2, 4, 7, 
18-22).  In  generr i  the  effectiveness  of  arithmetic 
coding  using  arbitrary  residues  p  other  than  of  the  form 
ill)  rests  heavily  on  the  effectiveness  of  the  residue 
calculations,  possibly  even  using  analog  techniques  (6). 
The  use  of  low-cost  residues  p  of  the  form  2^  “  1,  or 
more  generally  of  the  generalized  low-cost  residues  (11). 
simplifies  syndrome  calculation.  te  however  that 
various  tricks  may  also  be  useful.  The  residue  modulo  49 
is  not  needed  In  the  code  with  r=4  and  p=49  unless  an 
error  has  actually  occurred;  single-byte  errors  are 
completely  and  rapidly  detectable  by  use  of  the  residue  7 
alone.  For  the  code  with  r*16  and  p=73,  residues  modulo 
73  may  be  derived  from  residues  modulo  511,  since 
511=7x73. 

B’te-correcling  arithmetic  codea  also  provide  byte-error 
correction  when  used  In  memory,  e.g.,  in  a  by te-oi*ganized 
memory  (10).  As  seen  below,  some  of  these  codes  have 
redundancy  very  close  to  the  6est  comparable  byte-error 
correcting  codes  for  memory  (11)  given  hy  Hong  and  Patel. 
Such  codes  thus  have  potential  for  dual  use  both  in 
memory  (for  error  correction)  and  In  arithmetic  (for 
error  detection  at  least.  If  hot  for  error  correction). 
Advantages  of  such  dual  use  are  similar  to  those  In  the 
3 PL-STAR  (3),  which  uses  a  (modified)  residue  IS 
error-detecting  code.  See  also  116). 
k 

For  comparison  purposes,  the  redundancies  of 
byte-correcting  codes  of  various  types  are  summarized  in 
Table  III.  Included  are  the  by te-eorrectlng  Hong-Patel 
codes  for  memory,  denoted  by  "ll”  in  the  table,  and  the 
following  arithmetic  codes:  the  Av  and  gAN  codes  (denoted 
by  V),  (mul  ti-)resldue  codes  with  arbitrary  residue! 
(VV  (mul  tl-)resldue  codes  with  generalized  "low-cost'* 
residues  of  the  form  (11)  (denoted  by  "c") ,  and  those 
(mul t i-)residu«*  codes  with  low-cost  residues,  of  the  form 
2d  -l  (denoted  by  'V). 

The  near-perfect  AN  codes  (when  they  exist)  have 
redundancy 

log2  [ (r-l)p), 

at  most  one  bit  more  than  that  of  the  byte-correcting 
Hong-Patel  codes  for  memory  (11),  which  require  a 
redundancy  of  at  least  the  larger  of  2b  and 
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lo«2  [  ( r-1 ) (n/b)  4-  1]  *  lo*2  C<r-i)(p-l)/2  ♦  1). 

(The  Hong-Pmtel  redundancy  la  actually  equal  to  this 
latter  number  In  aany  caaea.) 

Arithmetic  codes  need  not  be  near-perfect  to  be  cloae  In 
redundancy  to  the  Hong- Patel  Memory  codes.  Several 
example*  of  auch  potential  dual-uae  arithmetic  codea  are 
worth  noting.  One  caae  la  that  with  r»4  and  with  k  from 
37  to  42.  Here  7  blta  of  redundancy  are  required  for 
byte  correction  In  memory  (II),  while  8  blta  aufflce  for 
aevera  forms  of  arithmetic  byte  correction  (A,  R,  G)# 
e.g. ,  l  «  AN  codea  with  A*3x71  and  with  A»3x79 ,  and  the 
general. zed  low-coat  realdue  code  with  p«49.  (Note  that 
the  Hamming  code  for  aingle-bit  error  correction  requires 
6  blta  of  redundancy.)  Other  examplea  with  tola  one-bit- 
extra  property  exist  for  b*3  with  k  from  45  to  62  (with  8 
blta  of  redundancy  for  H,  9  blta  for  A  and  R);  for  b«6 
with  k  up  to  42  (with  reaiduea  63  and  127  giving  P>13, 
lnatead  of  12  for  byte  correction  in  memory  alone);  and 
for  1^10  with  k  up  to  110  (with  residues  1023  and  2047 
giving  p*21,  instead  of  20).  In  many  cases,  however,  the 
arithmetic  code  redundancy  la  significantly  greater  than 
the  memory  code  redundancy.  In  auch  cases  the  arithmetic 
codea  may  not  be  suitable  for  dual  use,  although  they  m«y 
still  be  applicable  for  arithmetic  alone. 


X  CONCLUSIONS 

i 

The  codei  presented  here  have  considerable  potential  in 
the  realization  of  cost-effective  faul t-toleiant 
coeluting  systeais  capable  of  high  availability.  They 
contribute  a  new  approach  to  the  design  of  byte-sliced 
arithmetic  processors. 
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1-bit  bytea 

marc 


24  5 

6 

7 

8 

10 

32  6 

7 

8 

8 

12 

48  6 

7 

8 

12 

12 

64  7 

8 

9 

12 

14 

In  passing.  It  is  .orth  noting  two  oddities  for  p^31  (a 
low-cost  residue  for  r*2b),  namely  the  near-perfect  codes 
with  radices  r=7  and  r*10.  The  code  for  r-1  could  be 
quite  effective  In  a  binary-coded  decimal  machine.  It  Is 
also  interesting  to  observe  that,  due  to  Irregularities 
In  the  existence  of  good  codes  with  r  »t  least  A,  the 
redundancies  of  the  residue  codes  are  occasionally  less 
than  the  co^iarable  AN  codea.  Several  auch  examples 
exist  in  Table  III. 


2-bit  bytas 
MARC 


24  6 

8 

8 

8 

10 

32  6 

8 

R 

8 

12 

48  7 

8 

9 

12 

14 

64  7 

8 

9 

12 

14 

A  source  of  complexity  arises  when  a  truncated  code  Is 
used,  e.g.,  a  code  of  Theorem  7.  The  Implicit  truncation 
leads  to  the  need  for  an  Internal  overflow  correction, 
requiring  some  Increase  In  circuitry. 

The  systematic  multi-residue  codes  snd  the  gAN  codes  have 
advantage!  >ver  the  *N  codes  due  to  the  visibility  of 
their  lnformat'on  digits.  The  sailtl-resldue  codes  have 
the  disadvantage  that  the  check  digits  are  not  directly 
protected  by  the  code  as  they  are  In  the  AN  and  gAN 
codes.  Detailed  comparison  of  these  approaches  Is 
desirable  for  byte  correction.  However,  the  results  here 
apply  to  all  these  types  of  arithmetic  codes.  Another 
approach  to  error  correction  of  Iterative  errors 
resulting  from  a  single  fault  In  high-speed  arithmetic  Is 
found  In  [7],  Further  discussion  of  systems  aspects  are 
found  in  [IS]. 
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M 

"T3 
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>it  by  ten 

R  C 
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16 

24 

k 

8 

10 

11 

9 

10 

11 

11 

8 

11 

11 

13 

13 

48 

8 

12 

12 

14 

16 

1  64 

9 

12 

12 

14 

16 

Tabic  III. 

Redundancy  for  byte  correcting  codes: 

M  ■  Memory  error  correcting 
A  -  AN,  gAN  arithmetic  error  correcting 
R  -  Multi  rest  due  arithmetic  error  correcting 
G  -  R  with  generalized  low-cost  real  lu  s  only 
I.  »  R  with  low-cost  reaiduea  only 


mi 


APPENDIX  PROOFS  OF  THF  RESULTS 


Proof  of  Thtow  1  We  prove  (2)  uaing  the  concept  that 
in  AN  code  (with  arithmetic  distance  st  least  3  in  rad  la 
r)  la  alngle-byte  error  correcting  If  distinct  errors 
have  dtatinc*  syndrones.*  Consider  two  distinct  byte 
errors  E  ■  »rJ  and  E  •  9r*,  »tth  0  <  |o|  <  r, 

0  <  |d  |  <  r,  for  j  sm  1  among  0,  1 .  sr<p)  ”  1, 

J  4  1.  Suppose  th*lr  syndrome*  are  equal,  which  Is  to 
be  proven  Impossible.  (Assuiae  J  >  1,  without  loss  of 
gene  rsl  \  ty . )  Then , 

or^  *  8r  (mod  A),  and  or^  *  *  8  (mod  A), 
i.e.,  or1  •  B  (mod  A),  (12) 


for  t  ■  J -I,  some  Integer  saving  l . er(p)  -  l.  it 

follows  from  (12)  that  for  this  1 

or1  »  B  (nod  r-1)  (13) 

snd  or*  2  S  (mod  p)  (14) 


From  (13  1  .«  hi v.  it.,  a  £  B  (mod  r-1),  line*  r*  »  r  *  1 
(mod  r-l).  There  are  t»o  esses  if  Q  and  8  have  the 

seme  sign,  then  they  sre  equal.  When  they  s re  of  oppo¬ 
site  sign,  choose  or  to  be  positive  without  loss  of 
generality,  whence 

or  -  8  ♦  (r-\ ).  (IS) 

When  a  ■  8,  (11)  cannot  be  satisfied  for  sny  1,  0  < 
t  <  er(p),  whence  the  syndromes  of  sll  such  distinct 
errors  must  be  distinct,  ss  Is  to  be  proved.  When 
o  4  8,  substitute  (IS)  in  (14),  whence 
or1  »  o-(r-l)  (mod  p) ,  and  (a-r»l)  o‘l  *  r1  (mod  p), 
which  Implies  that  (hy  definttlon) 


(o-rwl)  a 


r  i 


c  Cr(p) 


This  Is  a  contradiction  of  lhe  hypothesis  (l),  Implying 
that  the  two  syndromes  muat^o^  distinct.  Finally,  if 
p-l  ts  not  In  3r(p),  then  r*r  P  *  l  (mod  p),  the 
smallest  positive  pow**r  of  r  having  this  property.  Thus 
the  code  esn^ Include  sll  code  words  AN  up  to  (hut  not 
tncludlng)  r*r  P  -  l,  which  has  arithmetic  weight  two.  0 


Proof  of  Corollary  3:  In  Theorem  2,  set  r  ■  2,  whence 
A  *  p.  We  observe  that  the  open  Interval  (9, r-l)  is 
empty,  snd  therefore  (l)  Is  trivially  satisfied.  Thus 
(4)  follows  from  (3)  In  Theorem  2.^ 


Proof  of  Corollary  <1 :  Set  r  *  3  in  Theorem  2.  in  the 
open  Interval  (0,  r-1)  there  Is  only  one  Integer  and 
that  Is  s  *  l.  For  that  esse  (a-r4l)a  *  -I.  Recall 
thst  -l  does  not  exist  In  G  (p)  (because  of  the  non- 
primttivlty  of  r  ■  3 — see  tKe  text  preceding  Theorem  2). 
Therefore  the  condttlon  (l)  la  satisfied,  snd  (5)  fol¬ 
lows  from  (3)  In  Theorem  2.  . 


Proof  of  Theorem  5  Set  r  *  4  In  Theorem  2.  In  the 
open  Interval  (0,  r-1)  exist  only  a  *  l  and  s  ■  2,  for 
which  (s-rwl)s"1  ■  -(3-s)s_l  la  -2  and 

( —  1 )  (—  ^*)  ■  respectively.  Sinew  -l  Is  not  In 

G^(p)  (because  of  the  nonprlnl tl vl ty  of  r  ■  4,  as  above), 
and  because  G4(p)  Is  Identical  to  G2(p)  In  this  csac 
(whence  -2  but  not  42  Is  primitive),  -2  cannot  be  In 


G4(p>.  For  the  ssme  reason,  (p-l)/2  esanot  be.  (If  It 
were,  then  -l  would  be.)  Thus  condition  (l)  holds.  □ 


Proof  of  Theorem  6:  The  proof  of  syndrome  uniqueness  Is 
Identical  to  thst  of  Theorem  1,  except  thst  the  ranges 
of  1,  j  snd  1  sre  up  to  er(p)/2  -  l.  This  Is  to  avoid 
ambiguity  between  positively  snd  negatively  signed  errors 
since  r*r*p*/2  *  -  l  (mod  p).  Consequently  the  code  esn 
Include  sll  code  words  AN  up  to  but  not  Including 
(r-l)r*r<P>/2  4  (r-l)  for  even  r,  snd 


(T1) 


•r(p)/i 


r-1  lor  odd  r, 

1  a  ’ 


th.  iuIIii:  nonzero  radix  r  representations  of  srithme- 
ttc  weight  t.o  divisible  by  (r-l)p.  Q 

Proof  of  Leras  >:  We  observe  thst  since  p  Is  (by  defini¬ 
tion)  s  prime,  d  must  be  s  prime  (Cstsldt-Fermat, 
see  (23], p. 3).  Thus  the  elements  of  C  (p)  are  prectsely 
the  first  d  consecutive  powers  of  2,  since  Gr(p)  Is  be r. 
Identical  to  C2(p),  since  ,cd  (b,d)  •  1.  Therefore  we 
need  only  prove  thst  i.J*  *  a-r.l  (mod  2^-1)  cannot 
occur.  Assume  thst  It  can.  Then 

t  d  b  h 

n- 2  *  2  -  2  .a  (mod  2°-)) 

and  S' 2*  »  2h(2<l*b  -1)  ♦  a  (mod  2*1-!).  (16) 


We  note  thst  on  the  nghthand  side  of  the  congruence  (16) 
we  have  an  Integer  less  than  2^-1  whose  binary  repre¬ 
sentation  has  two  ptrts,  the  higher  order  part  of  value 
3d  -  2b  and  the  lower  order  part  (b  digits)  of  value  a. 
Alao  we  note  the  Hamming  weight  of  this  Integer  (the 
number  of  on.s  In  the  binary  representation)  must  be  st 
least  one  greater  than  the  Hamming  weight  of  a.  On  the 
other  hand,  the  Hamming  weight  of  S' 2*  (mod  2d-l)  la  the 
same  ss  the  Hanning  weight  of  w,  for  the  reason  thst  the 
multiplication  by  a  power  of  2  modulo  2d  -  1  is  In 
effect  a  cyclic  .hl't  of  a  by  1  places  and  the  Hanning 
weight  la  lnvat .ant  under  cyclic  shifts.  Therefore 
the  congruence  (16)  cinnot  hold.  C 


Proof  of  Theor.m  >:  From  Lemma  H.  condttlon  (1)  of 
Theoren  1  la  satisfied.  Further,  the  elements  of  C  (p) 
sre  of  the  form  2k(k  ■  0,  l,.,.d-l),  snd  p-l  clearly  la 
not  In  Cf(p).  Also,  since  the  ged  of  b  snd  d  la  1,  we 
have  ef(2<,-l)  ■  d.  Thus  (10)  follows  from  Theorem  1.  □ 

Proof  of  Theorem  10:  If  r  >  ,  then  the  two  errors 

'  ,f  “n<,  '  r°  h*ve  Identical  syndromes,  vio¬ 

lating  error  correction.  This  follows  simply  from 
rf-l  *  0  (mod  r-1)  and  r*-l  »  (2f)b-l  e  0  (i>l  2f-l), 

whence  A  *  (r-l)p.  '(2f-l>,  which  divides  the 

difference  of  the  errors. 

7?r  *  (r‘'u-  ,f  r-  ,  then  these  errors  cannot 

artse  ss  single  byte  errors.  It  Is  readily  seen  thst 
2-1  cannot  divide  sfr1-!)  In  sny  other  way  for  i  <  t  <  , 
snd  thus  sll  slngle-hyte  error  syndromes  sre  distinct.  □ 


•  (p) 

•(The  error  4r  r  la  naturally  excluded.  Haney  (13) 
ahowa  that  otherwtae,  and  In  general  for  other  than 
aingle  errora,  It  la  not  neceaaary  for  all  ayndromea 
to  be  dtatlnct  for  correction  of  a  given  aet  of  errora.) 
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